Re: New zookeeper server fails to join quorum with msg "Have smaller server identifie"

Deepak Jagtap Thu, 13 Mar 2014 16:21:06 -0700

Hello Michi,

I observed following while testing patch for 1805 against trunk revision
1574686.
I ran " ant -Djavac.args="-Xlint -Xmaxwarns 1000" clean test tar"
against trunk revision 1574686.
Build failed as StandAloneDisabledTest failed.


After applying 1805 against 1574686 build failed with following test failed:
1. StandAloneDisabledTest
2. QuorumTest

When I only run QuorumTest against this (1574686 + 1805 patch) it succeeds.
 (using "ant -Dtestcase=QuorumTest test")

Please advise, if I should assume build is successful except
StandAloneDisabled test?

Thanks & Regards,
Deepak


On Mon, Mar 10, 2014 at 6:11 PM, Deepak Jagtap <[email protected]>wrote:

> Thanks Michi!
>
>
> On Mon, Mar 10, 2014 at 5:40 PM, Michi Mutsuzaki <[email protected]>wrote:
>
>> StandaloneDisabledTest.startSingleServerTest seems to be failing from
>> the same issue. We should fix this soon.
>>
>> https://issues.apache.org/jira/browse/ZOOKEEPER-1870
>>
>> On Mon, Mar 10, 2014 at 5:33 PM, Deepak Jagtap <[email protected]>
>> wrote:
>> > Hello,
>> >
>> > Another query regarding 1805.
>> > I am observing zookeeper rolling upgrade is always succeeds when I apply
>> > 1805 patch.
>> > When I apply both 1810 and 1805 patch rolling upgrade fails due to an
>> > issue mentioned earlier.
>> >
>> > Please advise, if it's fine to use only patch 1805 for the trunk?
>> >
>> > Thanks & Regards,
>> > Deepak
>> >
>> >
>> > On Mon, Mar 10, 2014 at 3:11 PM, Deepak Jagtap <[email protected]
>> >wrote:
>> >
>> >> Hi German,
>> >>
>> >> I have applied patch 1810 and 1805 against trunk revision 1574686
>> (recent
>> >> revision against which 1810 patch build succeeded).
>> >> But observing following error in the zookeeper log on the new node
>> joining
>> >> quorum:
>> >>
>> >> 2014-03-10 21:11:25,126 [myid:1] - INFO
>> >>  [WorkerSender[myid=1]:QuorumCnxManager@195] - Have smaller server
>> >> identifier, so dropping the connection: (3, 1)
>> >> 2014-03-10 21:11:25,127 [myid:1] - INFO  [/169.254.44.1:3888
>> >> :QuorumCnxManager$Listener@540] - Received connection request /
>> >> 169.254.44.3:51507
>> >> 2014-03-10 21:11:25,193 [myid:1] - ERROR
>> >> [WorkerReceiver[myid=1]:NIOServerCnxnFactory$1@92] - Thread
>> >> Thread[WorkerReceiver[myid=1],5,main] died
>> >> java.lang.OutOfMemoryError: Java heap space
>> >>    at
>> >>
>> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerReceiver.run(FastLeaderElection.java:273)
>> >>    at java.lang.Thread.run(Unknown Source)
>> >>
>> >> Followed by these messages getting printed repeatedly:
>> >> 2014-03-10 21:11:25,328 [myid:1] - INFO
>> >>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] -
>> >> Notification time out: 400
>> >> 2014-03-10 21:11:25,729 [myid:1] - INFO
>> >>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] -
>> >> Notification time out: 800
>> >> 2014-03-10 21:11:26,530 [myid:1] - INFO
>> >>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] -
>> >> Notification time out: 1600
>> >> 2014-03-10 21:11:28,131 [myid:1] - INFO
>> >>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] -
>> >> Notification time out: 3200
>> >> 2014-03-10 21:11:31,332 [myid:1] - INFO
>> >>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] -
>> >> Notification time out: 6400
>> >>
>> >> Thanks & Reagrds,
>> >> Deepak
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Wed, Mar 5, 2014 at 11:50 AM, Deepak Jagtap <
>> [email protected]>wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> I have applied only 1805 patch, not 1810.
>> >>> And upgrade is from 3.5.0.1458648 to 3.5.0.1562289 (not from 3.4.5).
>> >>> It was failing very consistently in our environment, and after 1805
>> patch
>> >>> it went smoothly.
>> >>>
>> >>> Regards,
>> >>> Deepak
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Wed, Mar 5, 2014 at 7:36 AM, German Blanco <
>> >>> [email protected]> wrote:
>> >>>
>> >>>> Hello,
>> >>>>
>> >>>> do you mean ZOOKEEPER-1810 patch?
>> >>>> That one alone doesn't solve the problem. On the other hand, the
>> problem
>> >>>> doesn't happen always, so after a rolling start it might get solved.
>> >>>> We need 1818 as well, but it is easier to go step by step and get
>> 1810 in
>> >>>> trunk first.
>> >>>> I hope that as soon as 3.4.6 is out this might get some attention.
>> >>>>
>> >>>> Regards,
>> >>>>
>> >>>> German.
>> >>>>
>> >>>>
>> >>>> On Wed, Mar 5, 2014 at 2:17 AM, Deepak Jagtap <
>> [email protected]
>> >>>> >wrote:
>> >>>>
>> >>>> > Hi,
>> >>>> >
>> >>>> > Please ignore the previous comment, I used wrong jar file and hence
>> >>>> rolling
>> >>>> > upgrade failed.
>> >>>> > After applying patch for bug  on zookeeper-3.5.0.1562289
>> >>>> > revision, rolling upgrade went fine.
>> >>>> >
>> >>>> > I have patched in house zookeeper version, but it would be
>> convenient
>> >>>> if we
>> >>>> > apply patch on trunk and use the latest trunk.
>> >>>> > Please advise if I can apply the patch on the trunk and test it for
>> >>>> you.
>> >>>> >
>> >>>> > Thanks & Regards,
>> >>>> > Deepak
>> >>>> >
>> >>>> >
>> >>>> > On Tue, Mar 4, 2014 at 12:09 PM, Deepak Jagtap <
>> >>>> [email protected]
>> >>>> > >wrote:
>> >>>> >
>> >>>> > > Hi German,
>> >>>> > >
>> >>>> > > I tried applying patch for 1805 but problem still persists.
>> >>>> > > Following are the notification messages logged repeatedly by the
>> node
>> >>>> > > which fails to join the quorum:
>> >>>> > >
>> >>>> > >
>> >>>> > > 2014-03-04 20:00:54,398 [myid:2] - INFO
>> >>>> > >  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@837]
>> -
>> >>>> > > Notification time out: 51200
>> >>>> > > 2014-03-04 20:00:54,400 [myid:2] - INFO
>> >>>> > >  [WorkerReceiver[myid=2]:FastLeaderElection@605] -
>> Notification: 2
>> >>>> > > (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2
>> >>>> (n.sid),
>> >>>> > 0x0
>> >>>> > > (n.peerEPoch), LOOKING (my state)1 (n.config version)
>> >>>> > > 2014-03-04 20:00:54,401 [myid:2] - INFO
>> >>>> > >  [WorkerReceiver[myid=2]:FastLeaderElection@605] -
>> Notification: 3
>> >>>> > > (n.leader), 0x100003e84 (n.zxid), 0x2 (n.round), FOLLOWING
>> >>>> (n.state), 1
>> >>>> > > (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)1 (n.config
>> version)
>> >>>> > > 2014-03-04 20:00:54,403 [myid:2] - INFO
>> >>>> > >  [WorkerReceiver[myid=2]:FastLeaderElection@605] -
>> Notification: 3
>> >>>> > > (n.leader), 0x100003e84 (n.zxid), 0xffffffffffffffff (n.round),
>> >>>> LEADING
>> >>>> > > (n.state), 3 (n.sid), 0x2 (n.peerEPoch), LOOKING (my state)1
>> >>>> (n.config
>> >>>> > > version)
>> >>>> > >
>> >>>> > >
>> >>>> > >
>> >>>> > > Patch for 1732 is already included in the trunk.
>> >>>> > >
>> >>>> > >
>> >>>> > > Thanks & Regards,
>> >>>> > > Deepak
>> >>>> > >
>> >>>> > >
>> >>>> > > On Fri, Feb 28, 2014 at 2:58 PM, Deepak Jagtap <
>> >>>> [email protected]
>> >>>> > >wrote:
>> >>>> > >
>> >>>> > >> Hi Flavio, German,
>> >>>> > >>
>> >>>> > >> Since this fix is critical for zookeeper rolling upgrade is it
>> ok
>> >>>> if I
>> >>>> > >> apply this patch to 3.5.0 trunk?
>> >>>> > >> Is it straightforward to apply this patch to trunk?
>> >>>> > >>
>> >>>> > >> Thanks & Regards,
>> >>>> > >> Deepak
>> >>>> > >>
>> >>>> > >>
>> >>>> > >> On Wed, Feb 26, 2014 at 11:46 AM, Deepak Jagtap <
>> >>>> > [email protected]>wrote:
>> >>>> > >>
>> >>>> > >>> Thanks German!
>> >>>> > >>> Just wondering is there any chance that this patch may be
>> applied
>> >>>> to
>> >>>> > >>> trunk in near future?
>> >>>> > >>> If it's fine with you guys, I would be more than happy to
>> apply the
>> >>>> > >>> fixes (from 3.4.5) to trunk and test them.
>> >>>> > >>>
>> >>>> > >>> Thanks & Regards,
>> >>>> > >>> Deepak
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>> On Wed, Feb 26, 2014 at 1:29 AM, German Blanco <
>> >>>> > >>> [email protected]> wrote:
>> >>>> > >>>
>> >>>> > >>>> Hello Deepak,
>> >>>> > >>>>
>> >>>> > >>>> due to ZOOKEEPER-1732 and then ZOOKEEPER-1805, there are some
>> >>>> cases in
>> >>>> > >>>> which an ensemble can be formed so that it doesn't allow any
>> other
>> >>>> > >>>> zookeeper server to join.
>> >>>> > >>>> This has been fixed in branch 3.4, but it hasn't been fixed in
>> >>>> trunk
>> >>>> > >>>> yet.
>> >>>> > >>>> Check if the Notifications sent around contain different
>> values
>> >>>> for
>> >>>> > the
>> >>>> > >>>> vote in the members of the ensemble.
>> >>>> > >>>> If you force a new election (e.g. by killing the leader) I
>> guess
>> >>>> > >>>> everything
>> >>>> > >>>> should work normally, but don't take my word for it.
>> >>>> > >>>> Flavio should know more about this.
>> >>>> > >>>>
>> >>>> > >>>> Cheers,
>> >>>> > >>>>
>> >>>> > >>>> German.
>> >>>> > >>>>
>> >>>> > >>>>
>> >>>> > >>>> On Wed, Feb 26, 2014 at 4:04 AM, Deepak Jagtap <
>> >>>> > [email protected]
>> >>>> > >>>> >wrote:
>> >>>> > >>>>
>> >>>> > >>>> > Hi,
>> >>>> > >>>> >
>> >>>> > >>>> > I replacing one of the zookeeper server from 3 node quorum.
>> >>>> > >>>> > Initially all zookeeper serves were running 3.5.0.1515976
>> >>>> version.
>> >>>> > >>>> > I successfully replaced Node3 with newer version
>> 3.5.0.1551730.
>> >>>> > >>>> > When I am trying to replace Node2 with the same zookeeper
>> >>>> version.
>> >>>> > >>>> > I couldn't start zookeeper server on Node2 as it is
>> continuously
>> >>>> > >>>> stuck in
>> >>>> > >>>> > leader election loop printing  following messages:
>> >>>> > >>>> >
>> >>>> > >>>> > 2014-02-26 02:45:23,709 [myid:3] - INFO
>> >>>> > >>>> >
>>  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@837]
>> >>>> -
>> >>>> > >>>> > Notification time out: 60000
>> >>>> > >>>> > 2014-02-26 02:45:23,710 [myid:3] - INFO
>> >>>> > >>>> >  [WorkerSender[myid=3]:QuorumCnxManager@195] - Have smaller
>> >>>> server
>> >>>> > >>>> > identifier, so dropping the connection: (5, 3)
>> >>>> > >>>> > 2014-02-26 02:45:23,712 [myid:3] - INFO
>> >>>> > >>>> >  [WorkerReceiver[myid=3]:FastLeaderElection@605] -
>> >>>> Notification: 3
>> >>>> > >>>> > (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state),
>> 3
>> >>>> > >>>> (n.sid), 0x0
>> >>>> > >>>> > (n.peerEPoch), LOOKING (my state)1 (n.config version)
>> >>>> > >>>> >
>> >>>> > >>>> >
>> >>>> > >>>> > Network connections and configuration of the node being
>> >>>> upgraded are
>> >>>> > >>>> fine.
>> >>>> > >>>> > The other 2 nodes in the quorum are fine and serving the
>> >>>> request.
>> >>>> > >>>> >
>> >>>> > >>>> > Any idea what might be causing this?
>> >>>> > >>>> >
>> >>>> > >>>> > Thanks & Regards,
>> >>>> > >>>> > Deepak
>> >>>> > >>>> >
>> >>>> > >>>>
>> >>>> > >>>
>> >>>> > >>>
>> >>>> > >>
>> >>>> > >
>> >>>> >
>> >>>>
>> >>>
>> >>>
>> >>
>>
>
>

Re: New zookeeper server fails to join quorum with msg "Have smaller server identifie"

Reply via email to