Looks like we're not getting *any* leader elected now.... Logs attached.
-----Original Message-----
From: Todd Greenwood [mailto:to...@audiencescience.com]
Sent: Tuesday, August 04, 2009 4:07 PM
To: zookeeper-dev@hadoop.apache.org
Subject: RE: Unending Leader Elections in WAN deploy
Patrick, thanks! I'll forward on to IT and I'll report back to you
shortly...
-----Original Message-----
From: Patrick Hunt [mailto:ph...@apache.org]
Sent: Tuesday, August 04, 2009 3:55 PM
To: zookeeper-dev@hadoop.apache.org
Subject: Re: Unending Leader Elections in WAN deploy
Todd, Mahadev and I looked at this and it turns out to be a
regression.
Ironically a patch I created for 3.2 branch to add quorum tests
actually
broke the quorum config -- a default value for a config parameter
was
lost. I'm going to submit a patch asap to get the default back, but
for
the time being you can set:
electionAlg=3
in each of your config files.
You should see reference to FastLeaderElection in your log files if
this
parameter is set correctly.
Sorry for the trouble,
Patrick
Todd Greenwood wrote:
Mahadev,
I just heard from IT that this build behaves in exactly the same
way
as
previous versions, e.g. we get continuous leader elections that
disconnect the followers and then get re-elected, and
disconnect...etc.
This is from a fresh sync to the 3.2 branch:
svn co
http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
./branch-3.2
CHANGES.TXT show the various fixes included:
to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
/src/original$ head -n 50 branch-3.2/CHANGES.txt
Release 3.2.1
Backward compatibile changes:
BUGFIXES:
ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris
via
flavio)
ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris
via
mahadev)
ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via
mahadev)
ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris
via
mahadev)
ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
(giri via mahadev)
ZOOKEEPER-467. Change log level in BookieHandle (flavio via
mahadev)
ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent
immediate
failure. (chris via mahadev)
ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev
via
phunt)
ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
other)
embedded clients (ryan rawson via phunt)
ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio
via
mahadev)
ZOOKEEPER-479. QuorumHierarchical does not count groups
correctly
(flavio via mahadev)
ZOOKEEPER-466. crash on zookeeper_close() when using auth with
empty
cert
(Chris Darroch via phunt)
ZOOKEEPER-480. FLE should perform leader check when node is not
leading and
add vote of follower (flavio via mahadev)
ZOOKEEPER-491. Prevent zero-weight servers from being elected
(flavio
via
mahadev)
What can I do to assist you with this issue?
-Todd
-----Original Message-----
From: Mahadev Konar [mailto:maha...@yahoo-inc.com]
Sent: Tuesday, August 04, 2009 12:43 PM
To: zookeeper-dev@hadoop.apache.org
Subject: Re: Unending Leader Elections in WAN deploy
Hi todd,
comments in line
On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
wrote:
Mahadev,
Some quick questions:
1. Version
I see that the CHANGES.txt calls this 3.2.1, but the build.xml
is
still
calling this 3.2.0. Should this be rev'd, and am I correct in
calling
this release 3.2.1?
Yes the release is 3.2.1. The build.xml will be fixed as soon as
we
tag
the
release.
2. Build targets
The package target fails b/c the create-cppunit-configure target
fails
due to various problems w/ respect to autoconf. Are these
dependencies
documented somewhere ? I'd like to have a fully building system.
create-cppunit-configure:
[exec] Can't exec "libtoolize": No such file or directory
at
/usr/bin/autoreconf line 188.
[exec] Use of uninitialized value $libtoolize in pattern
match
(m//) at /usr/bin/autoreconf line 188.
[exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT'
not
found
in library
[exec] configure.ac:33: error: possibly undefined macro:
AM_PATH_CPPUNIT
[exec] If this token and others are legitimate,
please
use
m4_pattern_allow.
[exec] See the Autoconf documentation.
[exec] configure.ac:53: error: possibly undefined macro:
AC_PROG_LIBTOOL
[exec] autoreconf: /usr/bin/autoconf failed with exit
status:
1
You need auto tools to run this. Please read the README for
building c
client library at src/c/ for the installation requirements.
3. Sync failure:
This is still failing.
svn: URL
'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
doesn't exist
Yes this hasn't been fixed yet!
Thanks
mahadev
-Todd
-----Original Message-----
From: Todd Greenwood
Sent: Tuesday, August 04, 2009 11:26 AM
To: 'zookeeper-u...@hadoop.apache.org'
Subject: RE: Unending Leader Elections in WAN deploy
Great news. Thank you Mahadev. I'll report our findings later
today.
-Todd
-----Original Message-----
From: Mahadev Konar [mailto:maha...@yahoo-inc.com]
Sent: Tuesday, August 04, 2009 11:20 AM
To: zookeeper-u...@hadoop.apache.org
Subject: Re: Unending Leader Elections in WAN deploy
Hi Todd,
I just committed 480 and 491. You can checkout the 3.2 branch
now.
Thanks
mahadev
On 8/3/09 4:29 PM, "Todd Greenwood"
<to...@audiencescience.com>
wrote:
That'd be perfect. Thanks!
-----Original Message-----
From: Mahadev Konar [mailto:maha...@yahoo-inc.com]
Sent: Monday, August 03, 2009 4:24 PM
To: zookeeper-u...@hadoop.apache.org
Subject: Re: Unending Leader Elections in WAN deploy
Hi Todd,
Most of the patches that you mention should be in the
branch
3.2 by
tomm
or so. 481, 479 are already in. 480 and 491 should be in by
tomm.
Would
that
suffice for you?
Thanks
mahadev
On 8/3/09 4:21 PM, "Todd Greenwood"
<to...@audiencescience.com>
wrote:
Another problem...I've reverted to the latest versions of
the
patches
that are not specific to branch-3.2, and I'm getting two
compilation
errors:
build-generated:
[javac] Compiling 44 source files to
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/build/classes
compile-main:
[javac] Compiling 2 source files to
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/build/classes
[javac]
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-
3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
mStats.java:30: name clash: getQuorumPeers() and
getQuorumPeers()
have
the same erasure
[javac] public String[] getQuorumPeers();
[javac] ^
[javac]
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-
3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
mStats.java:31: name clash: getServerState() and
getServerState()
have
the same erasure
[javac] public String getServerState();
[javac] ^
[javac] 2 errors
My build process is pretty simple:
1. copy the branch-3.2 source to a temp directory
(src/patched/branch-3.2)
2. apply the ZOOKEEPER patches in my patches directory
3. build zookeeper in the temp directory
-Todd
-----Original Message-----
From: Todd Greenwood [mailto:to...@audiencescience.com]
Sent: Monday, August 03, 2009 4:09 PM
To: zookeeper-u...@hadoop.apache.org
Subject: RE: Unending Leader Elections in WAN deploy
Flavio,
I notice that you've updated the patches referenced for
the
WAN
deployment. There appears to be an order dependency w/
respect
to
these
four patches...
ZOOKEEPER-473.patch ZOOKEEPER-479-branch3.2.patch
ZOOKEEPER-481-branch3.2.patch ZOOKEEPER-491.patch
473 -> 479 (479 fails)
to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
/src/patched/branch-3.2$ patch -p0 <
../patches/ZOOKEEPER-479-branch3.2.patch
patching file
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
ical.java
patching file
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
patching file
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
.java
patching file
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
Hunk #1 FAILED at 93.
Hunk #2 FAILED at 145.
2 out of 2 hunks FAILED -- saving rejects to file
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
/src/patched/branch-3.2$ h ../patches/
Could you advise as to which patches I need to apply, and
in
what
order?
-Todd
-----Original Message-----
From: Flavio Junqueira [mailto:f...@yahoo-inc.com]
Sent: Friday, July 31, 2009 9:51 PM
To: zookeeper-u...@hadoop.apache.org
Subject: Re: Unending Leader Elections in WAN deploy
Perfect! Thanks for the update, Todd.
-Flavio
On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
Thanks. You were right, I had a stale version of 479.
Compilation
succeeds and all tests pass on branch-3.2 with the
latest
patches
473,
479, 481, and 491.
-Todd
-----Original Message-----
From: Flavio Junqueira [mailto:f...@yahoo-inc.com]
Sent: Friday, July 31, 2009 7:48 PM
To: zookeeper-u...@hadoop.apache.org
Subject: Re: Unending Leader Elections in WAN deploy
It should be in 479. Perhaps you have a stale version
of
the
patch.
-Flavio
On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
Flavio,
I'm getting a compilation error for patch 491:
compile-main:
[javac] Compiling 1 source file to
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
src/p
atched/branch-3.2/build/classes
[javac]
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
src/p
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
FastL
eaderElection.java:601: cannot find symbol
[javac] symbol : method getWeight(long)
[javac] location: interface
org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
[javac]
if(self.getQuorumVerifier().getWeight(n.sid) != 0)
[javac]
^
[javac] 1 error
I see a reference to getWeight in both
FastLeaderElection.java
in
patch
491:
patches/ZOOKEEPER-491.patch:+
if(self.getQuorumVerifier().getWeight(n.sid) != 0)
src/java/main/org/apache/zookeeper/server/quorum/
FastLeaderElection.java
:
if(self.getQuorumVerifier().getWeight(n.sid) !=
0)
However, I don't see a reference to this method in
patches
473,
479,
or
481. I also don't see a reference to this method in
the
trunk...
-Todd
-----Original Message-----
From: Todd Greenwood
[mailto:to...@audiencescience.com]
Sent: Friday, July 31, 2009 7:30 PM
To: zookeeper-u...@hadoop.apache.org
Subject: RE: Unending Leader Elections in WAN deploy
Ok, I'll apply that patch and report back.
-Todd
-----Original Message-----
From: Flavio Junqueira [mailto:f...@yahoo-inc.com]
Sent: Friday, July 31, 2009 7:18 PM
To: zookeeper-u...@hadoop.apache.org
Subject: Re: Unending Leader Elections in WAN deploy
You're missing 491 from your set of patches.
-Flavio
On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
This repro's in both branch-3.2, and
branch-3.2+patches(473,
479,
481).
Basically, it seems like the nodes are electing
pd4-zook02
to
be
the
leader. However, pd4-zook02 seems to realize it's not
supposed
to
be
and
then disconnects everyone. Then they re-elect it
again,
and
it
loops
over and over.
-------------
Server config
-------------
server.1=dc1-zook01.dc01.revsci.net:2888:3888
server.2=dc1-zook02.dc01.revsci.net:2888:3888
server.3=dc1-zook03.dc01.revsci.net:2888:3888
server.4=dc1-zook04.dc01.revsci.net:2888:3888
server.5=dc1-zook05.dc01.revsci.net:2888:3888
server.6=pd1-zook01.pd01.revsci.net:2888:3888
server.7=pd1-zook02.pd01.revsci.net:2888:3888
server.8=pd4-zook01.iad1.audsci.net:2888:3888
server.9=pd4-zook02.iad1.audsci.net:2888:3888
group.1:1:2:3:4:5
weight.1=1
weight.2=1
weight.3=1
weight.4=1
weight.5=1
group.2:6:7:8:9
weight.6=0
weight.7=0
weight.8=0
weight.9=0
Note that we have 2 groups, composed of machines in 3
different
locations (dc1, pd1, and pd4). The idea is that only
machines
in
dc1
have voting rights, and the ability to become a
leader.
The
machines
in
the pods all have a weight of zero, and are not
expected
to
become
leaders, or to vote on transactions.
Let me know what I can do to help resolve this issue.
-Todd