Patrick, thanks! I'll forward on to IT and I'll report back to you shortly...
> -----Original Message----- > From: Patrick Hunt [mailto:ph...@apache.org] > Sent: Tuesday, August 04, 2009 3:55 PM > To: zookeeper-dev@hadoop.apache.org > Subject: Re: Unending Leader Elections in WAN deploy > > Todd, Mahadev and I looked at this and it turns out to be a regression. > Ironically a patch I created for 3.2 branch to add quorum tests actually > broke the quorum config -- a default value for a config parameter was > lost. I'm going to submit a patch asap to get the default back, but for > the time being you can set: > > electionAlg=3 > > in each of your config files. > > You should see reference to FastLeaderElection in your log files if this > parameter is set correctly. > > Sorry for the trouble, > > Patrick > > Todd Greenwood wrote: > > Mahadev, > > > > I just heard from IT that this build behaves in exactly the same way as > > previous versions, e.g. we get continuous leader elections that > > disconnect the followers and then get re-elected, and disconnect...etc. > > > > This is from a fresh sync to the 3.2 branch: > > > > svn co > > http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2 > > ./branch-3.2 > > > > CHANGES.TXT show the various fixes included: > > > > to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper > > /src/original$ head -n 50 branch-3.2/CHANGES.txt > > Release 3.2.1 > > > > Backward compatibile changes: > > > > BUGFIXES: > > ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris via > > flavio) > > > > ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris via > > mahadev) > > > > ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via mahadev) > > > > ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris via > > mahadev) > > > > ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure) > > (giri via mahadev) > > > > ZOOKEEPER-467. Change log level in BookieHandle (flavio via mahadev) > > > > ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent immediate > > failure. (chris via mahadev) > > > > ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev via > > phunt) > > > > ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and > > other) > > embedded clients (ryan rawson via phunt) > > > > ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio via > > mahadev) > > > > ZOOKEEPER-479. QuorumHierarchical does not count groups correctly > > (flavio via mahadev) > > > > ZOOKEEPER-466. crash on zookeeper_close() when using auth with empty > > cert > > (Chris Darroch via phunt) > > > > ZOOKEEPER-480. FLE should perform leader check when node is not > > leading and > > add vote of follower (flavio via mahadev) > > > > ZOOKEEPER-491. Prevent zero-weight servers from being elected (flavio > > via > > mahadev) > > > > What can I do to assist you with this issue? > > > > -Todd > > > >> -----Original Message----- > >> From: Mahadev Konar [mailto:maha...@yahoo-inc.com] > >> Sent: Tuesday, August 04, 2009 12:43 PM > >> To: zookeeper-dev@hadoop.apache.org > >> Subject: Re: Unending Leader Elections in WAN deploy > >> > >> Hi todd, > >> comments in line > >> > >> > >> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com> > > wrote: > >>> Mahadev, > >>> > >>> Some quick questions: > >>> > >>> 1. Version > >>> > >>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml is > > still > >>> calling this 3.2.0. Should this be rev'd, and am I correct in > > calling > >>> this release 3.2.1? > >> Yes the release is 3.2.1. The build.xml will be fixed as soon as we > > tag > >> the > >> release. > >> > >>> 2. Build targets > >>> > >>> The package target fails b/c the create-cppunit-configure target > > fails > >>> due to various problems w/ respect to autoconf. Are these > > dependencies > >>> documented somewhere ? I'd like to have a fully building system. > >>> > >>> create-cppunit-configure: > >>> [exec] Can't exec "libtoolize": No such file or directory at > >>> /usr/bin/autoreconf line 188. > >>> [exec] Use of uninitialized value $libtoolize in pattern match > >>> (m//) at /usr/bin/autoreconf line 188. > >>> [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not > > found > >>> in library > >>> [exec] configure.ac:33: error: possibly undefined macro: > >>> AM_PATH_CPPUNIT > >>> [exec] If this token and others are legitimate, please > > use > >>> m4_pattern_allow. > >>> [exec] See the Autoconf documentation. > >>> [exec] configure.ac:53: error: possibly undefined macro: > >>> AC_PROG_LIBTOOL > >>> [exec] autoreconf: /usr/bin/autoconf failed with exit status: 1 > >>> > >> You need auto tools to run this. Please read the README for building c > >> client library at src/c/ for the installation requirements. > >>> 3. Sync failure: > >>> > >>> This is still failing. > >>> > >>> svn: URL > >>> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch' > >>> doesn't exist > >>> > >> Yes this hasn't been fixed yet! > >> > >> Thanks > >> mahadev > >>> -Todd > >>> > >>>> -----Original Message----- > >>>> From: Todd Greenwood > >>>> Sent: Tuesday, August 04, 2009 11:26 AM > >>>> To: 'zookeeper-u...@hadoop.apache.org' > >>>> Subject: RE: Unending Leader Elections in WAN deploy > >>>> > >>>> Great news. Thank you Mahadev. I'll report our findings later > > today. > >>>> -Todd > >>>> > >>>>> -----Original Message----- > >>>>> From: Mahadev Konar [mailto:maha...@yahoo-inc.com] > >>>>> Sent: Tuesday, August 04, 2009 11:20 AM > >>>>> To: zookeeper-u...@hadoop.apache.org > >>>>> Subject: Re: Unending Leader Elections in WAN deploy > >>>>> > >>>>> Hi Todd, > >>>>> I just committed 480 and 491. You can checkout the 3.2 branch > > now. > >>>>> Thanks > >>>>> mahadev > >>>>> > >>>>> > >>>>> On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com> > >>> wrote: > >>>>>> That'd be perfect. Thanks! > >>>>>> > >>>>>>> -----Original Message----- > >>>>>>> From: Mahadev Konar [mailto:maha...@yahoo-inc.com] > >>>>>>> Sent: Monday, August 03, 2009 4:24 PM > >>>>>>> To: zookeeper-u...@hadoop.apache.org > >>>>>>> Subject: Re: Unending Leader Elections in WAN deploy > >>>>>>> > >>>>>>> Hi Todd, > >>>>>>> Most of the patches that you mention should be in the branch > >>> 3.2 by > >>>>>> tomm > >>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by > > tomm. > >>>>>> Would > >>>>>>> that > >>>>>>> suffice for you? > >>>>>>> > >>>>>>> Thanks > >>>>>>> mahadev > >>>>>>> > >>>>>>> > >>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com> > >>>> wrote: > >>>>>>>> Another problem...I've reverted to the latest versions of the > >>>>>> patches > >>>>>>>> that are not specific to branch-3.2, and I'm getting two > >>> compilation > >>>>>>>> errors: > >>>>>>>> > >>>>>>>> build-generated: > >>>>>>>> [javac] Compiling 44 source files to > >>>>>>>> > > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p > >>>>>>>> atched/branch-3.2/build/classes > >>>>>>>> > >>>>>>>> compile-main: > >>>>>>>> [javac] Compiling 2 source files to > >>>>>>>> > > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p > >>>>>>>> atched/branch-3.2/build/classes > >>>>>>>> [javac] > >>>>>>>> > > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p > >>>>>> atched/branch- > >>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru > >>>>>>>> mStats.java:30: name clash: getQuorumPeers() and > >>> getQuorumPeers() > >>>>>> have > >>>>>>>> the same erasure > >>>>>>>> [javac] public String[] getQuorumPeers(); > >>>>>>>> [javac] ^ > >>>>>>>> [javac] > >>>>>>>> > > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p > >>>>>> atched/branch- > >>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru > >>>>>>>> mStats.java:31: name clash: getServerState() and > >>> getServerState() > >>>>>> have > >>>>>>>> the same erasure > >>>>>>>> [javac] public String getServerState(); > >>>>>>>> [javac] ^ > >>>>>>>> [javac] 2 errors > >>>>>>>> > >>>>>>>> My build process is pretty simple: > >>>>>>>> > >>>>>>>> 1. copy the branch-3.2 source to a temp directory > >>>>>>>> (src/patched/branch-3.2) > >>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory > >>>>>>>> 3. build zookeeper in the temp directory > >>>>>>>> > >>>>>>>> -Todd > >>>>>>>>> -----Original Message----- > >>>>>>>>> From: Todd Greenwood [mailto:to...@audiencescience.com] > >>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM > >>>>>>>>> To: zookeeper-u...@hadoop.apache.org > >>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy > >>>>>>>>> > >>>>>>>>> Flavio, > >>>>>>>>> I notice that you've updated the patches referenced for the > > WAN > >>>>>>>>> deployment. There appears to be an order dependency w/ respect > >>> to > >>>>>>>> these > >>>>>>>>> four patches... > >>>>>>>>> > >>>>>>>>> ZOOKEEPER-473.patch ZOOKEEPER-479-branch3.2.patch > >>>>>>>>> ZOOKEEPER-481-branch3.2.patch ZOOKEEPER-491.patch > >>>>>>>>> > >>>>>>>>> 473 -> 479 (479 fails) > >>>>>>>>> > >>>>>>>>> > > to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper > >>>>>>>>> /src/patched/branch-3.2$ patch -p0 < > >>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch > >>>>>>>>> patching file > >>>>>>>>> > > src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch > >>>>>>>>> ical.java > >>>>>>>>> patching file > >>>>>>>>> > > src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java > >>>>>>>>> patching file > >>>>>>>>> > > src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier > >>>>>>>>> .java > >>>>>>>>> patching file > >>>>>>>>> > >>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java > >>>>>>>>> Hunk #1 FAILED at 93. > >>>>>>>>> Hunk #2 FAILED at 145. > >>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file > >>>>>>>>> > > src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej > > to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper > >>>>>>>>> /src/patched/branch-3.2$ h ../patches/ > >>>>>>>>> > >>>>>>>>> Could you advise as to which patches I need to apply, and in > >>> what > >>>>>>>> order? > >>>>>>>>> -Todd > >>>>>>>>> > >>>>>>>>>> -----Original Message----- > >>>>>>>>>> From: Flavio Junqueira [mailto:f...@yahoo-inc.com] > >>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM > >>>>>>>>>> To: zookeeper-u...@hadoop.apache.org > >>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy > >>>>>>>>>> > >>>>>>>>>> Perfect! Thanks for the update, Todd. > >>>>>>>>>> > >>>>>>>>>> -Flavio > >>>>>>>>>> > >>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote: > >>>>>>>>>> > >>>>>>>>>>> Thanks. You were right, I had a stale version of 479. > >>> Compilation > >>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the latest > >>> patches > >>>>>>>>> 473, > >>>>>>>>>>> 479, 481, and 491. > >>>>>>>>>>> > >>>>>>>>>>> -Todd > >>>>>>>>>>> > >>>>>>>>>>>> -----Original Message----- > >>>>>>>>>>>> From: Flavio Junqueira [mailto:f...@yahoo-inc.com] > >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM > >>>>>>>>>>>> To: zookeeper-u...@hadoop.apache.org > >>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy > >>>>>>>>>>>> > >>>>>>>>>>>> It should be in 479. Perhaps you have a stale version of > > the > >>>>>>>> patch. > >>>>>>>>>>>> -Flavio > >>>>>>>>>>>> > >>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Flavio, > >>>>>>>>>>>>> > >>>>>>>>>>>>> I'm getting a compilation error for patch 491: > >>>>>>>>>>>>> > >>>>>>>>>>>>> compile-main: > >>>>>>>>>>>>> [javac] Compiling 1 source file to > >>>>>>>>>>>>> > >>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/ > >>>>>>>>>>>>> src/p > >>>>>>>>>>>>> atched/branch-3.2/build/classes > >>>>>>>>>>>>> [javac] > >>>>>>>>>>>>> > >>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/ > >>>>>>>>>>>>> src/p > >>>>>>>>>>>>> > >>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/ > >>>>>>>>>>>>> FastL > >>>>>>>>>>>>> eaderElection.java:601: cannot find symbol > >>>>>>>>>>>>> [javac] symbol : method getWeight(long) > >>>>>>>>>>>>> [javac] location: interface > >>>>>>>>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier > >>>>>>>>>>>>> [javac] > >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0) > >>>>>>>>>>>>> [javac] > >>> ^ > >>>>>>>>>>>>> [javac] 1 error > >>>>>>>>>>>>> > >>>>>>>>>>>>> I see a reference to getWeight in both > >>> FastLeaderElection.java > >>>>>>>> in > >>>>>>>>>>>>> patch > >>>>>>>>>>>>> 491: > >>>>>>>>>>>>> > >>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+ > >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0) > >>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/ > >>>>>>>>>>>>> FastLeaderElection.java > >>>>>>>>>>>>> : > >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != > >>>>>>>>>>>>> 0) > >>>>>>>>>>>>> > >>>>>>>>>>>>> However, I don't see a reference to this method in patches > >>> 473, > >>>>>>>>> 479, > >>>>>>>>>>>>> or > >>>>>>>>>>>>> 481. I also don't see a reference to this method in the > >>>>>> trunk... > >>>>>>>>>>>>> -Todd > >>>>>>>>>>>>> > >>>>>>>>>>>>>> -----Original Message----- > >>>>>>>>>>>>>> From: Todd Greenwood [mailto:to...@audiencescience.com] > >>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM > >>>>>>>>>>>>>> To: zookeeper-u...@hadoop.apache.org > >>>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Ok, I'll apply that patch and report back. > >>>>>>>>>>>>>> -Todd > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> -----Original Message----- > >>>>>>>>>>>>>> From: Flavio Junqueira [mailto:f...@yahoo-inc.com] > >>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM > >>>>>>>>>>>>>> To: zookeeper-u...@hadoop.apache.org > >>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> You're missing 491 from your set of patches. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> -Flavio > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> This repro's in both branch-3.2, and > >>> branch-3.2+patches(473, > >>>>>>>>> 479, > >>>>>>>>>>>>>> 481). > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Basically, it seems like the nodes are electing > >>> pd4-zook02 > >>>>>> to > >>>>>>>>> be > >>>>>>>>>>>>> the > >>>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not > >>>>>>>> supposed > >>>>>>>>> to > >>>>>>>>>>>>> be > >>>>>>>>>>>>>> and > >>>>>>>>>>>>>> then disconnects everyone. Then they re-elect it again, > >>> and > >>>>>>>> it > >>>>>>>>>>>>> loops > >>>>>>>>>>>>>> over and over. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> ------------- > >>>>>>>>>>>>>> Server config > >>>>>>>>>>>>>> ------------- > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888 > >>>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888 > >>>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888 > >>>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888 > >>>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888 > >>>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888 > >>>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888 > >>>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888 > >>>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888 > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> group.1:1:2:3:4:5 > >>>>>>>>>>>>>> weight.1=1 > >>>>>>>>>>>>>> weight.2=1 > >>>>>>>>>>>>>> weight.3=1 > >>>>>>>>>>>>>> weight.4=1 > >>>>>>>>>>>>>> weight.5=1 > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> group.2:6:7:8:9 > >>>>>>>>>>>>>> weight.6=0 > >>>>>>>>>>>>>> weight.7=0 > >>>>>>>>>>>>>> weight.8=0 > >>>>>>>>>>>>>> weight.9=0 > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3 > >>>>>>>> different > >>>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only > >>>>>> machines > >>>>>>>>> in > >>>>>>>>>>>>> dc1 > >>>>>>>>>>>>>> have voting rights, and the ability to become a leader. > >>> The > >>>>>>>>>>>>> machines > >>>>>>>>>>>>>> in > >>>>>>>>>>>>>> the pods all have a weight of zero, and are not expected > >>> to > >>>>>>>>>>> become > >>>>>>>>>>>>>> leaders, or to vote on transactions. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Let me know what I can do to help resolve this issue. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> -Todd > >