Hello, Another query regarding 1805. I am observing zookeeper rolling upgrade is always succeeds when I apply 1805 patch. When I apply both 1810 and 1805 patch rolling upgrade fails due to an issue mentioned earlier.
Please advise, if it's fine to use only patch 1805 for the trunk? Thanks & Regards, Deepak On Mon, Mar 10, 2014 at 3:11 PM, Deepak Jagtap <[email protected]>wrote: > Hi German, > > I have applied patch 1810 and 1805 against trunk revision 1574686 (recent > revision against which 1810 patch build succeeded). > But observing following error in the zookeeper log on the new node joining > quorum: > > 2014-03-10 21:11:25,126 [myid:1] - INFO > [WorkerSender[myid=1]:QuorumCnxManager@195] - Have smaller server > identifier, so dropping the connection: (3, 1) > 2014-03-10 21:11:25,127 [myid:1] - INFO [/169.254.44.1:3888 > :QuorumCnxManager$Listener@540] - Received connection request / > 169.254.44.3:51507 > 2014-03-10 21:11:25,193 [myid:1] - ERROR > [WorkerReceiver[myid=1]:NIOServerCnxnFactory$1@92] - Thread > Thread[WorkerReceiver[myid=1],5,main] died > java.lang.OutOfMemoryError: Java heap space > at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerReceiver.run(FastLeaderElection.java:273) > at java.lang.Thread.run(Unknown Source) > > Followed by these messages getting printed repeatedly: > 2014-03-10 21:11:25,328 [myid:1] - INFO > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] - > Notification time out: 400 > 2014-03-10 21:11:25,729 [myid:1] - INFO > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] - > Notification time out: 800 > 2014-03-10 21:11:26,530 [myid:1] - INFO > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] - > Notification time out: 1600 > 2014-03-10 21:11:28,131 [myid:1] - INFO > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] - > Notification time out: 3200 > 2014-03-10 21:11:31,332 [myid:1] - INFO > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@900] - > Notification time out: 6400 > > Thanks & Reagrds, > Deepak > > > > > > On Wed, Mar 5, 2014 at 11:50 AM, Deepak Jagtap <[email protected]>wrote: > >> Hi, >> >> I have applied only 1805 patch, not 1810. >> And upgrade is from 3.5.0.1458648 to 3.5.0.1562289 (not from 3.4.5). >> It was failing very consistently in our environment, and after 1805 patch >> it went smoothly. >> >> Regards, >> Deepak >> >> >> >> >> >> On Wed, Mar 5, 2014 at 7:36 AM, German Blanco < >> [email protected]> wrote: >> >>> Hello, >>> >>> do you mean ZOOKEEPER-1810 patch? >>> That one alone doesn't solve the problem. On the other hand, the problem >>> doesn't happen always, so after a rolling start it might get solved. >>> We need 1818 as well, but it is easier to go step by step and get 1810 in >>> trunk first. >>> I hope that as soon as 3.4.6 is out this might get some attention. >>> >>> Regards, >>> >>> German. >>> >>> >>> On Wed, Mar 5, 2014 at 2:17 AM, Deepak Jagtap <[email protected] >>> >wrote: >>> >>> > Hi, >>> > >>> > Please ignore the previous comment, I used wrong jar file and hence >>> rolling >>> > upgrade failed. >>> > After applying patch for bug on zookeeper-3.5.0.1562289 >>> > revision, rolling upgrade went fine. >>> > >>> > I have patched in house zookeeper version, but it would be convenient >>> if we >>> > apply patch on trunk and use the latest trunk. >>> > Please advise if I can apply the patch on the trunk and test it for >>> you. >>> > >>> > Thanks & Regards, >>> > Deepak >>> > >>> > >>> > On Tue, Mar 4, 2014 at 12:09 PM, Deepak Jagtap < >>> [email protected] >>> > >wrote: >>> > >>> > > Hi German, >>> > > >>> > > I tried applying patch for 1805 but problem still persists. >>> > > Following are the notification messages logged repeatedly by the node >>> > > which fails to join the quorum: >>> > > >>> > > >>> > > 2014-03-04 20:00:54,398 [myid:2] - INFO >>> > > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@837] - >>> > > Notification time out: 51200 >>> > > 2014-03-04 20:00:54,400 [myid:2] - INFO >>> > > [WorkerReceiver[myid=2]:FastLeaderElection@605] - Notification: 2 >>> > > (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 >>> (n.sid), >>> > 0x0 >>> > > (n.peerEPoch), LOOKING (my state)1 (n.config version) >>> > > 2014-03-04 20:00:54,401 [myid:2] - INFO >>> > > [WorkerReceiver[myid=2]:FastLeaderElection@605] - Notification: 3 >>> > > (n.leader), 0x100003e84 (n.zxid), 0x2 (n.round), FOLLOWING >>> (n.state), 1 >>> > > (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)1 (n.config version) >>> > > 2014-03-04 20:00:54,403 [myid:2] - INFO >>> > > [WorkerReceiver[myid=2]:FastLeaderElection@605] - Notification: 3 >>> > > (n.leader), 0x100003e84 (n.zxid), 0xffffffffffffffff (n.round), >>> LEADING >>> > > (n.state), 3 (n.sid), 0x2 (n.peerEPoch), LOOKING (my state)1 >>> (n.config >>> > > version) >>> > > >>> > > >>> > > >>> > > Patch for 1732 is already included in the trunk. >>> > > >>> > > >>> > > Thanks & Regards, >>> > > Deepak >>> > > >>> > > >>> > > On Fri, Feb 28, 2014 at 2:58 PM, Deepak Jagtap < >>> [email protected] >>> > >wrote: >>> > > >>> > >> Hi Flavio, German, >>> > >> >>> > >> Since this fix is critical for zookeeper rolling upgrade is it ok >>> if I >>> > >> apply this patch to 3.5.0 trunk? >>> > >> Is it straightforward to apply this patch to trunk? >>> > >> >>> > >> Thanks & Regards, >>> > >> Deepak >>> > >> >>> > >> >>> > >> On Wed, Feb 26, 2014 at 11:46 AM, Deepak Jagtap < >>> > [email protected]>wrote: >>> > >> >>> > >>> Thanks German! >>> > >>> Just wondering is there any chance that this patch may be applied >>> to >>> > >>> trunk in near future? >>> > >>> If it's fine with you guys, I would be more than happy to apply the >>> > >>> fixes (from 3.4.5) to trunk and test them. >>> > >>> >>> > >>> Thanks & Regards, >>> > >>> Deepak >>> > >>> >>> > >>> >>> > >>> On Wed, Feb 26, 2014 at 1:29 AM, German Blanco < >>> > >>> [email protected]> wrote: >>> > >>> >>> > >>>> Hello Deepak, >>> > >>>> >>> > >>>> due to ZOOKEEPER-1732 and then ZOOKEEPER-1805, there are some >>> cases in >>> > >>>> which an ensemble can be formed so that it doesn't allow any other >>> > >>>> zookeeper server to join. >>> > >>>> This has been fixed in branch 3.4, but it hasn't been fixed in >>> trunk >>> > >>>> yet. >>> > >>>> Check if the Notifications sent around contain different values >>> for >>> > the >>> > >>>> vote in the members of the ensemble. >>> > >>>> If you force a new election (e.g. by killing the leader) I guess >>> > >>>> everything >>> > >>>> should work normally, but don't take my word for it. >>> > >>>> Flavio should know more about this. >>> > >>>> >>> > >>>> Cheers, >>> > >>>> >>> > >>>> German. >>> > >>>> >>> > >>>> >>> > >>>> On Wed, Feb 26, 2014 at 4:04 AM, Deepak Jagtap < >>> > [email protected] >>> > >>>> >wrote: >>> > >>>> >>> > >>>> > Hi, >>> > >>>> > >>> > >>>> > I replacing one of the zookeeper server from 3 node quorum. >>> > >>>> > Initially all zookeeper serves were running 3.5.0.1515976 >>> version. >>> > >>>> > I successfully replaced Node3 with newer version 3.5.0.1551730. >>> > >>>> > When I am trying to replace Node2 with the same zookeeper >>> version. >>> > >>>> > I couldn't start zookeeper server on Node2 as it is continuously >>> > >>>> stuck in >>> > >>>> > leader election loop printing following messages: >>> > >>>> > >>> > >>>> > 2014-02-26 02:45:23,709 [myid:3] - INFO >>> > >>>> > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@837] >>> - >>> > >>>> > Notification time out: 60000 >>> > >>>> > 2014-02-26 02:45:23,710 [myid:3] - INFO >>> > >>>> > [WorkerSender[myid=3]:QuorumCnxManager@195] - Have smaller >>> server >>> > >>>> > identifier, so dropping the connection: (5, 3) >>> > >>>> > 2014-02-26 02:45:23,712 [myid:3] - INFO >>> > >>>> > [WorkerReceiver[myid=3]:FastLeaderElection@605] - >>> Notification: 3 >>> > >>>> > (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 3 >>> > >>>> (n.sid), 0x0 >>> > >>>> > (n.peerEPoch), LOOKING (my state)1 (n.config version) >>> > >>>> > >>> > >>>> > >>> > >>>> > Network connections and configuration of the node being >>> upgraded are >>> > >>>> fine. >>> > >>>> > The other 2 nodes in the quorum are fine and serving the >>> request. >>> > >>>> > >>> > >>>> > Any idea what might be causing this? >>> > >>>> > >>> > >>>> > Thanks & Regards, >>> > >>>> > Deepak >>> > >>>> > >>> > >>>> >>> > >>> >>> > >>> >>> > >> >>> > > >>> > >>> >> >> >
