Re: New zookeeper server fails to join quorum with msg "Have smaller server identifie"

Michi Mutsuzaki Mon, 10 Mar 2014 17:41:16 -0700

StandaloneDisabledTest.startSingleServerTest seems to be failing from
the same issue. We should fix this soon.


https://issues.apache.org/jira/browse/ZOOKEEPER-1870

On Mon, Mar 10, 2014 at 5:33 PM, Deepak Jagtap <[email protected]> wrote:
> Hello,
>
> Another query regarding 1805.
> I am observing zookeeper rolling upgrade is always succeeds when I apply
> 1805 patch.
> When I apply both 1810 and 1805 patch rolling upgrade fails due to an
> issue mentioned earlier.
>
> Please advise, if it's fine to use only patch 1805 for the trunk?
>
> Thanks & Regards,
> Deepak
>
>
> On Mon, Mar 10, 2014 at 3:11 PM, Deepak Jagtap <[email protected]>wrote:
>
>> Hi German,
>>
>> I have applied patch 1810 and 1805 against trunk revision 1574686 (recent
>> revision against which 1810 patch build succeeded).
>> But observing following error in the zookeeper log on the new node joining
>> quorum:
>>
>> 2014-03-10 21:11:25,126 [myid:1] - INFO
>>  [WorkerSender[myid=1]:QuorumCnxManager@195] - Have smaller server
>> identifier, so dropping the connection: (3, 1)
>> 2014-03-10 21:11:25,127 [myid:1] - INFO  [/169.254.44.1:3888
>> :QuorumCnxManager$Listener@540] - Received connection request /
>> 169.254.44.3:51507
>> 2014-03-10 21:11:25,193 [myid:1] - ERROR
>> [WorkerReceiver[myid=1]:NIOServerCnxnFactory$1@92] - Thread
>> Thread[WorkerReceiver[myid=1],5,main] died
>> java.lang.OutOfMemoryError: Java heap space
>>    at
>> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerReceiver.run(FastLeaderElection.java:273)
>>    at java.lang.Thread.run(Unknown Source)
>>
>> Followed by these messages getting printed repeatedly:
>> 2014-03-10 21:11:25,328 [myid:1] - INFO
>>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] -
>> Notification time out: 400
>> 2014-03-10 21:11:25,729 [myid:1] - INFO
>>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] -
>> Notification time out: 800
>> 2014-03-10 21:11:26,530 [myid:1] - INFO
>>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] -
>> Notification time out: 1600
>> 2014-03-10 21:11:28,131 [myid:1] - INFO
>>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] -
>> Notification time out: 3200
>> 2014-03-10 21:11:31,332 [myid:1] - INFO
>>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] -
>> Notification time out: 6400
>>
>> Thanks & Reagrds,
>> Deepak
>>
>>
>>
>>
>>
>> On Wed, Mar 5, 2014 at 11:50 AM, Deepak Jagtap 
>> <[email protected]>wrote:
>>
>>> Hi,
>>>
>>> I have applied only 1805 patch, not 1810.
>>> And upgrade is from 3.5.0.1458648 to 3.5.0.1562289 (not from 3.4.5).
>>> It was failing very consistently in our environment, and after 1805 patch
>>> it went smoothly.
>>>
>>> Regards,
>>> Deepak
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 7:36 AM, German Blanco <
>>> [email protected]> wrote:
>>>
>>>> Hello,
>>>>
>>>> do you mean ZOOKEEPER-1810 patch?
>>>> That one alone doesn't solve the problem. On the other hand, the problem
>>>> doesn't happen always, so after a rolling start it might get solved.
>>>> We need 1818 as well, but it is easier to go step by step and get 1810 in
>>>> trunk first.
>>>> I hope that as soon as 3.4.6 is out this might get some attention.
>>>>
>>>> Regards,
>>>>
>>>> German.
>>>>
>>>>
>>>> On Wed, Mar 5, 2014 at 2:17 AM, Deepak Jagtap <[email protected]
>>>> >wrote:
>>>>
>>>> > Hi,
>>>> >
>>>> > Please ignore the previous comment, I used wrong jar file and hence
>>>> rolling
>>>> > upgrade failed.
>>>> > After applying patch for bug  on zookeeper-3.5.0.1562289
>>>> > revision, rolling upgrade went fine.
>>>> >
>>>> > I have patched in house zookeeper version, but it would be convenient
>>>> if we
>>>> > apply patch on trunk and use the latest trunk.
>>>> > Please advise if I can apply the patch on the trunk and test it for
>>>> you.
>>>> >
>>>> > Thanks & Regards,
>>>> > Deepak
>>>> >
>>>> >
>>>> > On Tue, Mar 4, 2014 at 12:09 PM, Deepak Jagtap <
>>>> [email protected]
>>>> > >wrote:
>>>> >
>>>> > > Hi German,
>>>> > >
>>>> > > I tried applying patch for 1805 but problem still persists.
>>>> > > Following are the notification messages logged repeatedly by the node
>>>> > > which fails to join the quorum:
>>>> > >
>>>> > >
>>>> > > 2014-03-04 20:00:54,398 [myid:2] - INFO
>>>> > >  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@837] -
>>>> > > Notification time out: 51200
>>>> > > 2014-03-04 20:00:54,400 [myid:2] - INFO
>>>> > >  [WorkerReceiver[myid=2]:FastLeaderElection@605] - Notification: 2
>>>> > > (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2
>>>> (n.sid),
>>>> > 0x0
>>>> > > (n.peerEPoch), LOOKING (my state)1 (n.config version)
>>>> > > 2014-03-04 20:00:54,401 [myid:2] - INFO
>>>> > >  [WorkerReceiver[myid=2]:FastLeaderElection@605] - Notification: 3
>>>> > > (n.leader), 0x100003e84 (n.zxid), 0x2 (n.round), FOLLOWING
>>>> (n.state), 1
>>>> > > (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)1 (n.config version)
>>>> > > 2014-03-04 20:00:54,403 [myid:2] - INFO
>>>> > >  [WorkerReceiver[myid=2]:FastLeaderElection@605] - Notification: 3
>>>> > > (n.leader), 0x100003e84 (n.zxid), 0xffffffffffffffff (n.round),
>>>> LEADING
>>>> > > (n.state), 3 (n.sid), 0x2 (n.peerEPoch), LOOKING (my state)1
>>>> (n.config
>>>> > > version)
>>>> > >
>>>> > >
>>>> > >
>>>> > > Patch for 1732 is already included in the trunk.
>>>> > >
>>>> > >
>>>> > > Thanks & Regards,
>>>> > > Deepak
>>>> > >
>>>> > >
>>>> > > On Fri, Feb 28, 2014 at 2:58 PM, Deepak Jagtap <
>>>> [email protected]
>>>> > >wrote:
>>>> > >
>>>> > >> Hi Flavio, German,
>>>> > >>
>>>> > >> Since this fix is critical for zookeeper rolling upgrade is it ok
>>>> if I
>>>> > >> apply this patch to 3.5.0 trunk?
>>>> > >> Is it straightforward to apply this patch to trunk?
>>>> > >>
>>>> > >> Thanks & Regards,
>>>> > >> Deepak
>>>> > >>
>>>> > >>
>>>> > >> On Wed, Feb 26, 2014 at 11:46 AM, Deepak Jagtap <
>>>> > [email protected]>wrote:
>>>> > >>
>>>> > >>> Thanks German!
>>>> > >>> Just wondering is there any chance that this patch may be applied
>>>> to
>>>> > >>> trunk in near future?
>>>> > >>> If it's fine with you guys, I would be more than happy to apply the
>>>> > >>> fixes (from 3.4.5) to trunk and test them.
>>>> > >>>
>>>> > >>> Thanks & Regards,
>>>> > >>> Deepak
>>>> > >>>
>>>> > >>>
>>>> > >>> On Wed, Feb 26, 2014 at 1:29 AM, German Blanco <
>>>> > >>> [email protected]> wrote:
>>>> > >>>
>>>> > >>>> Hello Deepak,
>>>> > >>>>
>>>> > >>>> due to ZOOKEEPER-1732 and then ZOOKEEPER-1805, there are some
>>>> cases in
>>>> > >>>> which an ensemble can be formed so that it doesn't allow any other
>>>> > >>>> zookeeper server to join.
>>>> > >>>> This has been fixed in branch 3.4, but it hasn't been fixed in
>>>> trunk
>>>> > >>>> yet.
>>>> > >>>> Check if the Notifications sent around contain different values
>>>> for
>>>> > the
>>>> > >>>> vote in the members of the ensemble.
>>>> > >>>> If you force a new election (e.g. by killing the leader) I guess
>>>> > >>>> everything
>>>> > >>>> should work normally, but don't take my word for it.
>>>> > >>>> Flavio should know more about this.
>>>> > >>>>
>>>> > >>>> Cheers,
>>>> > >>>>
>>>> > >>>> German.
>>>> > >>>>
>>>> > >>>>
>>>> > >>>> On Wed, Feb 26, 2014 at 4:04 AM, Deepak Jagtap <
>>>> > [email protected]
>>>> > >>>> >wrote:
>>>> > >>>>
>>>> > >>>> > Hi,
>>>> > >>>> >
>>>> > >>>> > I replacing one of the zookeeper server from 3 node quorum.
>>>> > >>>> > Initially all zookeeper serves were running 3.5.0.1515976
>>>> version.
>>>> > >>>> > I successfully replaced Node3 with newer version 3.5.0.1551730.
>>>> > >>>> > When I am trying to replace Node2 with the same zookeeper
>>>> version.
>>>> > >>>> > I couldn't start zookeeper server on Node2 as it is continuously
>>>> > >>>> stuck in
>>>> > >>>> > leader election loop printing  following messages:
>>>> > >>>> >
>>>> > >>>> > 2014-02-26 02:45:23,709 [myid:3] - INFO
>>>> > >>>> >  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@837]
>>>> -
>>>> > >>>> > Notification time out: 60000
>>>> > >>>> > 2014-02-26 02:45:23,710 [myid:3] - INFO
>>>> > >>>> >  [WorkerSender[myid=3]:QuorumCnxManager@195] - Have smaller
>>>> server
>>>> > >>>> > identifier, so dropping the connection: (5, 3)
>>>> > >>>> > 2014-02-26 02:45:23,712 [myid:3] - INFO
>>>> > >>>> >  [WorkerReceiver[myid=3]:FastLeaderElection@605] -
>>>> Notification: 3
>>>> > >>>> > (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 3
>>>> > >>>> (n.sid), 0x0
>>>> > >>>> > (n.peerEPoch), LOOKING (my state)1 (n.config version)
>>>> > >>>> >
>>>> > >>>> >
>>>> > >>>> > Network connections and configuration of the node being
>>>> upgraded are
>>>> > >>>> fine.
>>>> > >>>> > The other 2 nodes in the quorum are fine and serving the
>>>> request.
>>>> > >>>> >
>>>> > >>>> > Any idea what might be causing this?
>>>> > >>>> >
>>>> > >>>> > Thanks & Regards,
>>>> > >>>> > Deepak
>>>> > >>>> >
>>>> > >>>>
>>>> > >>>
>>>> > >>>
>>>> > >>
>>>> > >
>>>> >
>>>>
>>>
>>>
>>

Re: New zookeeper server fails to join quorum with msg "Have smaller server identifie"

Reply via email to