Re: New zookeeper server fails to join quorum with msg "Have smaller server identifie"

Deepak Jagtap Mon, 10 Mar 2014 17:33:50 -0700

Hello,

Another query regarding 1805.
I am observing zookeeper rolling upgrade is always succeeds when I apply
1805 patch.
When I apply both 1810 and 1805 patch rolling upgrade fails due to an
issue mentioned earlier.


Please advise, if it's fine to use only patch 1805 for the trunk?

Thanks & Regards,
Deepak


On Mon, Mar 10, 2014 at 3:11 PM, Deepak Jagtap <[email protected]>wrote:

> Hi German,
>
> I have applied patch 1810 and 1805 against trunk revision 1574686 (recent
> revision against which 1810 patch build succeeded).
> But observing following error in the zookeeper log on the new node joining
> quorum:
>
> 2014-03-10 21:11:25,126 [myid:1] - INFO
>  [WorkerSender[myid=1]:QuorumCnxManager@195] - Have smaller server
> identifier, so dropping the connection: (3, 1)
> 2014-03-10 21:11:25,127 [myid:1] - INFO  [/169.254.44.1:3888
> :QuorumCnxManager$Listener@540] - Received connection request /
> 169.254.44.3:51507
> 2014-03-10 21:11:25,193 [myid:1] - ERROR
> [WorkerReceiver[myid=1]:NIOServerCnxnFactory$1@92] - Thread
> Thread[WorkerReceiver[myid=1],5,main] died
> java.lang.OutOfMemoryError: Java heap space
>    at
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerReceiver.run(FastLeaderElection.java:273)
>    at java.lang.Thread.run(Unknown Source)
>
> Followed by these messages getting printed repeatedly:
> 2014-03-10 21:11:25,328 [myid:1] - INFO
>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] -
> Notification time out: 400
> 2014-03-10 21:11:25,729 [myid:1] - INFO
>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] -
> Notification time out: 800
> 2014-03-10 21:11:26,530 [myid:1] - INFO
>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] -
> Notification time out: 1600
> 2014-03-10 21:11:28,131 [myid:1] - INFO
>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] -
> Notification time out: 3200
> 2014-03-10 21:11:31,332 [myid:1] - INFO
>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] -
> Notification time out: 6400
>
> Thanks & Reagrds,
> Deepak
>
>
>
>
>
> On Wed, Mar 5, 2014 at 11:50 AM, Deepak Jagtap <[email protected]>wrote:
>
>> Hi,
>>
>> I have applied only 1805 patch, not 1810.
>> And upgrade is from 3.5.0.1458648 to 3.5.0.1562289 (not from 3.4.5).
>> It was failing very consistently in our environment, and after 1805 patch
>> it went smoothly.
>>
>> Regards,
>> Deepak
>>
>>
>>
>>
>>
>> On Wed, Mar 5, 2014 at 7:36 AM, German Blanco <
>> [email protected]> wrote:
>>
>>> Hello,
>>>
>>> do you mean ZOOKEEPER-1810 patch?
>>> That one alone doesn't solve the problem. On the other hand, the problem
>>> doesn't happen always, so after a rolling start it might get solved.
>>> We need 1818 as well, but it is easier to go step by step and get 1810 in
>>> trunk first.
>>> I hope that as soon as 3.4.6 is out this might get some attention.
>>>
>>> Regards,
>>>
>>> German.
>>>
>>>
>>> On Wed, Mar 5, 2014 at 2:17 AM, Deepak Jagtap <[email protected]
>>> >wrote:
>>>
>>> > Hi,
>>> >
>>> > Please ignore the previous comment, I used wrong jar file and hence
>>> rolling
>>> > upgrade failed.
>>> > After applying patch for bug  on zookeeper-3.5.0.1562289
>>> > revision, rolling upgrade went fine.
>>> >
>>> > I have patched in house zookeeper version, but it would be convenient
>>> if we
>>> > apply patch on trunk and use the latest trunk.
>>> > Please advise if I can apply the patch on the trunk and test it for
>>> you.
>>> >
>>> > Thanks & Regards,
>>> > Deepak
>>> >
>>> >
>>> > On Tue, Mar 4, 2014 at 12:09 PM, Deepak Jagtap <
>>> [email protected]
>>> > >wrote:
>>> >
>>> > > Hi German,
>>> > >
>>> > > I tried applying patch for 1805 but problem still persists.
>>> > > Following are the notification messages logged repeatedly by the node
>>> > > which fails to join the quorum:
>>> > >
>>> > >
>>> > > 2014-03-04 20:00:54,398 [myid:2] - INFO
>>> > >  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@837] -
>>> > > Notification time out: 51200
>>> > > 2014-03-04 20:00:54,400 [myid:2] - INFO
>>> > >  [WorkerReceiver[myid=2]:FastLeaderElection@605] - Notification: 2
>>> > > (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2
>>> (n.sid),
>>> > 0x0
>>> > > (n.peerEPoch), LOOKING (my state)1 (n.config version)
>>> > > 2014-03-04 20:00:54,401 [myid:2] - INFO
>>> > >  [WorkerReceiver[myid=2]:FastLeaderElection@605] - Notification: 3
>>> > > (n.leader), 0x100003e84 (n.zxid), 0x2 (n.round), FOLLOWING
>>> (n.state), 1
>>> > > (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)1 (n.config version)
>>> > > 2014-03-04 20:00:54,403 [myid:2] - INFO
>>> > >  [WorkerReceiver[myid=2]:FastLeaderElection@605] - Notification: 3
>>> > > (n.leader), 0x100003e84 (n.zxid), 0xffffffffffffffff (n.round),
>>> LEADING
>>> > > (n.state), 3 (n.sid), 0x2 (n.peerEPoch), LOOKING (my state)1
>>> (n.config
>>> > > version)
>>> > >
>>> > >
>>> > >
>>> > > Patch for 1732 is already included in the trunk.
>>> > >
>>> > >
>>> > > Thanks & Regards,
>>> > > Deepak
>>> > >
>>> > >
>>> > > On Fri, Feb 28, 2014 at 2:58 PM, Deepak Jagtap <
>>> [email protected]
>>> > >wrote:
>>> > >
>>> > >> Hi Flavio, German,
>>> > >>
>>> > >> Since this fix is critical for zookeeper rolling upgrade is it ok
>>> if I
>>> > >> apply this patch to 3.5.0 trunk?
>>> > >> Is it straightforward to apply this patch to trunk?
>>> > >>
>>> > >> Thanks & Regards,
>>> > >> Deepak
>>> > >>
>>> > >>
>>> > >> On Wed, Feb 26, 2014 at 11:46 AM, Deepak Jagtap <
>>> > [email protected]>wrote:
>>> > >>
>>> > >>> Thanks German!
>>> > >>> Just wondering is there any chance that this patch may be applied
>>> to
>>> > >>> trunk in near future?
>>> > >>> If it's fine with you guys, I would be more than happy to apply the
>>> > >>> fixes (from 3.4.5) to trunk and test them.
>>> > >>>
>>> > >>> Thanks & Regards,
>>> > >>> Deepak
>>> > >>>
>>> > >>>
>>> > >>> On Wed, Feb 26, 2014 at 1:29 AM, German Blanco <
>>> > >>> [email protected]> wrote:
>>> > >>>
>>> > >>>> Hello Deepak,
>>> > >>>>
>>> > >>>> due to ZOOKEEPER-1732 and then ZOOKEEPER-1805, there are some
>>> cases in
>>> > >>>> which an ensemble can be formed so that it doesn't allow any other
>>> > >>>> zookeeper server to join.
>>> > >>>> This has been fixed in branch 3.4, but it hasn't been fixed in
>>> trunk
>>> > >>>> yet.
>>> > >>>> Check if the Notifications sent around contain different values
>>> for
>>> > the
>>> > >>>> vote in the members of the ensemble.
>>> > >>>> If you force a new election (e.g. by killing the leader) I guess
>>> > >>>> everything
>>> > >>>> should work normally, but don't take my word for it.
>>> > >>>> Flavio should know more about this.
>>> > >>>>
>>> > >>>> Cheers,
>>> > >>>>
>>> > >>>> German.
>>> > >>>>
>>> > >>>>
>>> > >>>> On Wed, Feb 26, 2014 at 4:04 AM, Deepak Jagtap <
>>> > [email protected]
>>> > >>>> >wrote:
>>> > >>>>
>>> > >>>> > Hi,
>>> > >>>> >
>>> > >>>> > I replacing one of the zookeeper server from 3 node quorum.
>>> > >>>> > Initially all zookeeper serves were running 3.5.0.1515976
>>> version.
>>> > >>>> > I successfully replaced Node3 with newer version 3.5.0.1551730.
>>> > >>>> > When I am trying to replace Node2 with the same zookeeper
>>> version.
>>> > >>>> > I couldn't start zookeeper server on Node2 as it is continuously
>>> > >>>> stuck in
>>> > >>>> > leader election loop printing  following messages:
>>> > >>>> >
>>> > >>>> > 2014-02-26 02:45:23,709 [myid:3] - INFO
>>> > >>>> >  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@837]
>>> -
>>> > >>>> > Notification time out: 60000
>>> > >>>> > 2014-02-26 02:45:23,710 [myid:3] - INFO
>>> > >>>> >  [WorkerSender[myid=3]:QuorumCnxManager@195] - Have smaller
>>> server
>>> > >>>> > identifier, so dropping the connection: (5, 3)
>>> > >>>> > 2014-02-26 02:45:23,712 [myid:3] - INFO
>>> > >>>> >  [WorkerReceiver[myid=3]:FastLeaderElection@605] -
>>> Notification: 3
>>> > >>>> > (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 3
>>> > >>>> (n.sid), 0x0
>>> > >>>> > (n.peerEPoch), LOOKING (my state)1 (n.config version)
>>> > >>>> >
>>> > >>>> >
>>> > >>>> > Network connections and configuration of the node being
>>> upgraded are
>>> > >>>> fine.
>>> > >>>> > The other 2 nodes in the quorum are fine and serving the
>>> request.
>>> > >>>> >
>>> > >>>> > Any idea what might be causing this?
>>> > >>>> >
>>> > >>>> > Thanks & Regards,
>>> > >>>> > Deepak
>>> > >>>> >
>>> > >>>>
>>> > >>>
>>> > >>>
>>> > >>
>>> > >
>>> >
>>>
>>
>>
>

Re: New zookeeper server fails to join quorum with msg "Have smaller server identifie"

Reply via email to