Mahadev, I just heard from IT that this build behaves in exactly the same way as previous versions, e.g. we get continuous leader elections that disconnect the followers and then get re-elected, and disconnect...etc.
This is from a fresh sync to the 3.2 branch: svn co http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2 ./branch-3.2 CHANGES.TXT show the various fixes included: to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper /src/original$ head -n 50 branch-3.2/CHANGES.txt Release 3.2.1 Backward compatibile changes: BUGFIXES: ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris via flavio) ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris via mahadev) ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via mahadev) ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris via mahadev) ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure) (giri via mahadev) ZOOKEEPER-467. Change log level in BookieHandle (flavio via mahadev) ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent immediate failure. (chris via mahadev) ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev via phunt) ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and other) embedded clients (ryan rawson via phunt) ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio via mahadev) ZOOKEEPER-479. QuorumHierarchical does not count groups correctly (flavio via mahadev) ZOOKEEPER-466. crash on zookeeper_close() when using auth with empty cert (Chris Darroch via phunt) ZOOKEEPER-480. FLE should perform leader check when node is not leading and add vote of follower (flavio via mahadev) ZOOKEEPER-491. Prevent zero-weight servers from being elected (flavio via mahadev) What can I do to assist you with this issue? -Todd > -----Original Message----- > From: Mahadev Konar [mailto:maha...@yahoo-inc.com] > Sent: Tuesday, August 04, 2009 12:43 PM > To: zookeeper-dev@hadoop.apache.org > Subject: Re: Unending Leader Elections in WAN deploy > > Hi todd, > comments in line > > > On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com> wrote: > > > Mahadev, > > > > Some quick questions: > > > > 1. Version > > > > I see that the CHANGES.txt calls this 3.2.1, but the build.xml is still > > calling this 3.2.0. Should this be rev'd, and am I correct in calling > > this release 3.2.1? > Yes the release is 3.2.1. The build.xml will be fixed as soon as we tag > the > release. > > > > > 2. Build targets > > > > The package target fails b/c the create-cppunit-configure target fails > > due to various problems w/ respect to autoconf. Are these dependencies > > documented somewhere ? I'd like to have a fully building system. > > > > create-cppunit-configure: > > [exec] Can't exec "libtoolize": No such file or directory at > > /usr/bin/autoreconf line 188. > > [exec] Use of uninitialized value $libtoolize in pattern match > > (m//) at /usr/bin/autoreconf line 188. > > [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not found > > in library > > [exec] configure.ac:33: error: possibly undefined macro: > > AM_PATH_CPPUNIT > > [exec] If this token and others are legitimate, please use > > m4_pattern_allow. > > [exec] See the Autoconf documentation. > > [exec] configure.ac:53: error: possibly undefined macro: > > AC_PROG_LIBTOOL > > [exec] autoreconf: /usr/bin/autoconf failed with exit status: 1 > > > You need auto tools to run this. Please read the README for building c > client library at src/c/ for the installation requirements. > > > > 3. Sync failure: > > > > This is still failing. > > > > svn: URL > > 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch' > > doesn't exist > > > > Yes this hasn't been fixed yet! > > Thanks > mahadev > > -Todd > > > >> -----Original Message----- > >> From: Todd Greenwood > >> Sent: Tuesday, August 04, 2009 11:26 AM > >> To: 'zookeeper-u...@hadoop.apache.org' > >> Subject: RE: Unending Leader Elections in WAN deploy > >> > >> Great news. Thank you Mahadev. I'll report our findings later today. > >> -Todd > >> > >>> -----Original Message----- > >>> From: Mahadev Konar [mailto:maha...@yahoo-inc.com] > >>> Sent: Tuesday, August 04, 2009 11:20 AM > >>> To: zookeeper-u...@hadoop.apache.org > >>> Subject: Re: Unending Leader Elections in WAN deploy > >>> > >>> Hi Todd, > >>> I just committed 480 and 491. You can checkout the 3.2 branch now. > >>> > >>> Thanks > >>> mahadev > >>> > >>> > >>> On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com> > > wrote: > >>> > >>>> That'd be perfect. Thanks! > >>>> > >>>>> -----Original Message----- > >>>>> From: Mahadev Konar [mailto:maha...@yahoo-inc.com] > >>>>> Sent: Monday, August 03, 2009 4:24 PM > >>>>> To: zookeeper-u...@hadoop.apache.org > >>>>> Subject: Re: Unending Leader Elections in WAN deploy > >>>>> > >>>>> Hi Todd, > >>>>> Most of the patches that you mention should be in the branch > > 3.2 by > >>>> tomm > >>>>> or so. 481, 479 are already in. 480 and 491 should be in by tomm. > >>>> Would > >>>>> that > >>>>> suffice for you? > >>>>> > >>>>> Thanks > >>>>> mahadev > >>>>> > >>>>> > >>>>> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com> > >> wrote: > >>>>> > >>>>>> Another problem...I've reverted to the latest versions of the > >>>> patches > >>>>>> that are not specific to branch-3.2, and I'm getting two > > compilation > >>>>>> errors: > >>>>>> > >>>>>> build-generated: > >>>>>> [javac] Compiling 44 source files to > >>>>>> > >>>> > >> > > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p > >>>>>> atched/branch-3.2/build/classes > >>>>>> > >>>>>> compile-main: > >>>>>> [javac] Compiling 2 source files to > >>>>>> > >>>> > >> > > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p > >>>>>> atched/branch-3.2/build/classes > >>>>>> [javac] > >>>>>> > >>>> > >> > > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p > >>>>>> > >>>> atched/branch- > >> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru > >>>>>> mStats.java:30: name clash: getQuorumPeers() and > > getQuorumPeers() > >>>> have > >>>>>> the same erasure > >>>>>> [javac] public String[] getQuorumPeers(); > >>>>>> [javac] ^ > >>>>>> [javac] > >>>>>> > >>>> > >> > > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p > >>>>>> > >>>> atched/branch- > >> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru > >>>>>> mStats.java:31: name clash: getServerState() and > > getServerState() > >>>> have > >>>>>> the same erasure > >>>>>> [javac] public String getServerState(); > >>>>>> [javac] ^ > >>>>>> [javac] 2 errors > >>>>>> > >>>>>> My build process is pretty simple: > >>>>>> > >>>>>> 1. copy the branch-3.2 source to a temp directory > >>>>>> (src/patched/branch-3.2) > >>>>>> 2. apply the ZOOKEEPER patches in my patches directory > >>>>>> 3. build zookeeper in the temp directory > >>>>>> > >>>>>> -Todd > >>>>>>> -----Original Message----- > >>>>>>> From: Todd Greenwood [mailto:to...@audiencescience.com] > >>>>>>> Sent: Monday, August 03, 2009 4:09 PM > >>>>>>> To: zookeeper-u...@hadoop.apache.org > >>>>>>> Subject: RE: Unending Leader Elections in WAN deploy > >>>>>>> > >>>>>>> Flavio, > >>>>>>> I notice that you've updated the patches referenced for the WAN > >>>>>>> deployment. There appears to be an order dependency w/ respect > > to > >>>>>> these > >>>>>>> four patches... > >>>>>>> > >>>>>>> ZOOKEEPER-473.patch ZOOKEEPER-479-branch3.2.patch > >>>>>>> ZOOKEEPER-481-branch3.2.patch ZOOKEEPER-491.patch > >>>>>>> > >>>>>>> 473 -> 479 (479 fails) > >>>>>>> > >>>>>>> > >>>>>> > >>>> > >> > > to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper > >>>>>>> /src/patched/branch-3.2$ patch -p0 < > >>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch > >>>>>>> patching file > >>>>>>> > >>>>>> > >>>> > >> > > src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch > >>>>>>> ical.java > >>>>>>> patching file > >>>>>>> > >>>>>> > >>>> > >> > > src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java > >>>>>>> patching file > >>>>>>> > >>>>>> > >>>> > >> > > src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier > >>>>>>> .java > >>>>>>> patching file > >>>>>>> > > src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java > >>>>>>> Hunk #1 FAILED at 93. > >>>>>>> Hunk #2 FAILED at 145. > >>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file > >>>>>>> > >>>>>> > >>>> > >> > > src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej > >>>>>>> > >>>>>> > >>>> > >> > > to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper > >>>>>>> /src/patched/branch-3.2$ h ../patches/ > >>>>>>> > >>>>>>> Could you advise as to which patches I need to apply, and in > > what > >>>>>> order? > >>>>>>> > >>>>>>> -Todd > >>>>>>> > >>>>>>>> -----Original Message----- > >>>>>>>> From: Flavio Junqueira [mailto:f...@yahoo-inc.com] > >>>>>>>> Sent: Friday, July 31, 2009 9:51 PM > >>>>>>>> To: zookeeper-u...@hadoop.apache.org > >>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy > >>>>>>>> > >>>>>>>> Perfect! Thanks for the update, Todd. > >>>>>>>> > >>>>>>>> -Flavio > >>>>>>>> > >>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote: > >>>>>>>> > >>>>>>>>> Thanks. You were right, I had a stale version of 479. > > Compilation > >>>>>>>>> succeeds and all tests pass on branch-3.2 with the latest > > patches > >>>>>>> 473, > >>>>>>>>> 479, 481, and 491. > >>>>>>>>> > >>>>>>>>> -Todd > >>>>>>>>> > >>>>>>>>>> -----Original Message----- > >>>>>>>>>> From: Flavio Junqueira [mailto:f...@yahoo-inc.com] > >>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM > >>>>>>>>>> To: zookeeper-u...@hadoop.apache.org > >>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy > >>>>>>>>>> > >>>>>>>>>> It should be in 479. Perhaps you have a stale version of the > >>>>>> patch. > >>>>>>>>>> > >>>>>>>>>> -Flavio > >>>>>>>>>> > >>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote: > >>>>>>>>>> > >>>>>>>>>>> Flavio, > >>>>>>>>>>> > >>>>>>>>>>> I'm getting a compilation error for patch 491: > >>>>>>>>>>> > >>>>>>>>>>> compile-main: > >>>>>>>>>>> [javac] Compiling 1 source file to > >>>>>>>>>>> > >>>>>>> > > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/ > >>>>>>>>>>> src/p > >>>>>>>>>>> atched/branch-3.2/build/classes > >>>>>>>>>>> [javac] > >>>>>>>>>>> > >>>>>>> > > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/ > >>>>>>>>>>> src/p > >>>>>>>>>>> > >>>>>>> > > atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/ > >>>>>>>>>>> FastL > >>>>>>>>>>> eaderElection.java:601: cannot find symbol > >>>>>>>>>>> [javac] symbol : method getWeight(long) > >>>>>>>>>>> [javac] location: interface > >>>>>>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier > >>>>>>>>>>> [javac] > >>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0) > >>>>>>>>>>> [javac] > > ^ > >>>>>>>>>>> [javac] 1 error > >>>>>>>>>>> > >>>>>>>>>>> I see a reference to getWeight in both > > FastLeaderElection.java > >>>>>> in > >>>>>>>>>>> patch > >>>>>>>>>>> 491: > >>>>>>>>>>> > >>>>>>>>>>> patches/ZOOKEEPER-491.patch:+ > >>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0) > >>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/ > >>>>>>>>>>> FastLeaderElection.java > >>>>>>>>>>> : > >>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != > >>>>>>>>>>> 0) > >>>>>>>>>>> > >>>>>>>>>>> However, I don't see a reference to this method in patches > > 473, > >>>>>>> 479, > >>>>>>>>>>> or > >>>>>>>>>>> 481. I also don't see a reference to this method in the > >>>> trunk... > >>>>>>>>>>> > >>>>>>>>>>> -Todd > >>>>>>>>>>> > >>>>>>>>>>>> -----Original Message----- > >>>>>>>>>>>> From: Todd Greenwood [mailto:to...@audiencescience.com] > >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM > >>>>>>>>>>>> To: zookeeper-u...@hadoop.apache.org > >>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy > >>>>>>>>>>>> > >>>>>>>>>>>> Ok, I'll apply that patch and report back. > >>>>>>>>>>>> -Todd > >>>>>>>>>>>> > >>>>>>>>>>>> -----Original Message----- > >>>>>>>>>>>> From: Flavio Junqueira [mailto:f...@yahoo-inc.com] > >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM > >>>>>>>>>>>> To: zookeeper-u...@hadoop.apache.org > >>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy > >>>>>>>>>>>> > >>>>>>>>>>>> You're missing 491 from your set of patches. > >>>>>>>>>>>> > >>>>>>>>>>>> -Flavio > >>>>>>>>>>>> > >>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> This repro's in both branch-3.2, and > > branch-3.2+patches(473, > >>>>>>> 479, > >>>>>>>>>>>> 481). > >>>>>>>>>>>> > >>>>>>>>>>>> Basically, it seems like the nodes are electing > > pd4-zook02 > >>>> to > >>>>>>> be > >>>>>>>>>>> the > >>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not > >>>>>> supposed > >>>>>>> to > >>>>>>>>>>> be > >>>>>>>>>>>> and > >>>>>>>>>>>> then disconnects everyone. Then they re-elect it again, > > and > >>>>>> it > >>>>>>>>>>> loops > >>>>>>>>>>>> over and over. > >>>>>>>>>>>> > >>>>>>>>>>>> ------------- > >>>>>>>>>>>> Server config > >>>>>>>>>>>> ------------- > >>>>>>>>>>>> > >>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888 > >>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888 > >>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888 > >>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888 > >>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888 > >>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888 > >>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888 > >>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888 > >>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888 > >>>>>>>>>>>> > >>>>>>>>>>>> group.1:1:2:3:4:5 > >>>>>>>>>>>> weight.1=1 > >>>>>>>>>>>> weight.2=1 > >>>>>>>>>>>> weight.3=1 > >>>>>>>>>>>> weight.4=1 > >>>>>>>>>>>> weight.5=1 > >>>>>>>>>>>> > >>>>>>>>>>>> group.2:6:7:8:9 > >>>>>>>>>>>> weight.6=0 > >>>>>>>>>>>> weight.7=0 > >>>>>>>>>>>> weight.8=0 > >>>>>>>>>>>> weight.9=0 > >>>>>>>>>>>> > >>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3 > >>>>>> different > >>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only > >>>> machines > >>>>>>> in > >>>>>>>>>>> dc1 > >>>>>>>>>>>> have voting rights, and the ability to become a leader. > > The > >>>>>>>>>>> machines > >>>>>>>>>>>> in > >>>>>>>>>>>> the pods all have a weight of zero, and are not expected > > to > >>>>>>>>> become > >>>>>>>>>>>> leaders, or to vote on transactions. > >>>>>>>>>>>> > >>>>>>>>>>>> Let me know what I can do to help resolve this issue. > >>>>>>>>>>>> > >>>>>>>>>>>> -Todd > >>>>>>>>>>> > >>>>>>>>> > >>>>>> > >>>> > >