Mahadev/Flavio -- looks like 0 weight is still busted, fle0weighttest is actually failing on my machine, however it's reported as success:
------------- Standard Error -----------------
Exception in thread "Thread-108" junit.framework.AssertionFailedError: Elected zero-weight server
        at junit.framework.Assert.fail(Assert.java:47)
at org.apache.zookeeper.test.FLEZeroWeightTest$LEThread.run(FLEZeroWeightTest.java:138)
------------- ---------------- ---------------

this is probably due because the test is calling assert in a thread other than the main test thread - which junit will not track/knowabout.

One problem I see with these tests (0weight test I looked at) -- it doesn't have a client attempt to connect to the various servers as part of declaring success. Really we should only consider "success"ful test (ie assert that) if a client can connect to each server in the cluster and change/seechanges. As part of fixing this we really need to do a sanity check by testing the various command lines and checking that a client can connect.

I'm not even sure FLEnewepochtest/fletest/etc... are passing either. new epoch seems to just thrash...

Also I tried 3 & 5 server quorums "by hand from the command line" with 0 weight and they see similar issues to what Todd is seeing.

I'm using the latest code in mainline btw.

Patrick

Mahadev Konar wrote:
Hi todd,
I see a lot of
java.net.ConnectException: Connection refused
        at sun.nio.ch.Net.connect(Native Method)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
        at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxMana
ger.java:324)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.
java:304)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender
.process(FastLeaderElection.java:317)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender
.run(FastLeaderElection.java:290)
        at java.lang.Thread.run(Thread.java:619)


Is it possible that there is some firewall? Can all the servers 1-9 connect
to all the others using ports that you specified in zoo.cfg i.e 2888/3888?


Thanks
mahadev


On 8/4/09 4:56 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:

Looks like we're not getting *any* leader elected now.... Logs attached.

-----Original Message-----
From: Todd Greenwood [mailto:to...@audiencescience.com]
Sent: Tuesday, August 04, 2009 4:07 PM
To: zookeeper-dev@hadoop.apache.org
Subject: RE: Unending Leader Elections in WAN deploy

Patrick, thanks! I'll forward on to IT and I'll report back to you
shortly...

-----Original Message-----
From: Patrick Hunt [mailto:ph...@apache.org]
Sent: Tuesday, August 04, 2009 3:55 PM
To: zookeeper-dev@hadoop.apache.org
Subject: Re: Unending Leader Elections in WAN deploy

Todd, Mahadev and I looked at this and it turns out to be a
regression.
Ironically a patch I created for 3.2 branch to add quorum tests
actually
broke the quorum config -- a default value for a config parameter
was
lost. I'm going to submit a patch asap to get the default back, but
for
the time being you can set:

electionAlg=3

in each of your config files.

You should see reference to FastLeaderElection in your log files if
this
parameter is set correctly.

Sorry for the trouble,

Patrick

Todd Greenwood wrote:
Mahadev,

I just heard from IT that this build behaves in exactly the same
way
as
previous versions, e.g. we get continuous leader elections that
disconnect the followers and then get re-elected, and
disconnect...etc.
This is from a fresh sync to the 3.2 branch:

svn co

http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
./branch-3.2

CHANGES.TXT show the various fixes included:


to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
/src/original$ head -n 50 branch-3.2/CHANGES.txt
Release 3.2.1

Backward compatibile changes:

BUGFIXES:
  ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris
via
flavio)

  ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris
via
mahadev)

  ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via
mahadev)
  ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris
via
mahadev)

  ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
  (giri via mahadev)

  ZOOKEEPER-467.  Change log level in BookieHandle (flavio via
mahadev)
  ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent
immediate
  failure. (chris via mahadev)

  ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev
via
phunt)

  ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
other)
  embedded clients (ryan rawson via phunt)

  ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio
via
mahadev)

  ZOOKEEPER-479.  QuorumHierarchical does not count groups
correctly
  (flavio via mahadev)

  ZOOKEEPER-466. crash on zookeeper_close() when using auth with
empty
cert
  (Chris Darroch via phunt)

  ZOOKEEPER-480. FLE should perform leader check when node is not
leading and
  add vote of follower (flavio via mahadev)

  ZOOKEEPER-491. Prevent zero-weight servers from being elected
(flavio
via
  mahadev)

What can I do to assist you with this issue?

-Todd

-----Original Message-----
From: Mahadev Konar [mailto:maha...@yahoo-inc.com]
Sent: Tuesday, August 04, 2009 12:43 PM
To: zookeeper-dev@hadoop.apache.org
Subject: Re: Unending Leader Elections in WAN deploy

Hi todd,
 comments in line


On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
wrote:
Mahadev,

Some quick questions:

1. Version

I see that the CHANGES.txt calls this 3.2.1, but the build.xml
is
still
calling this 3.2.0. Should this be rev'd, and am I correct in
calling
this release 3.2.1?
Yes the release is 3.2.1. The build.xml will be fixed as soon as
we
tag
the
release.

2. Build targets

The package target fails b/c the create-cppunit-configure target
fails
due to various problems w/ respect to autoconf. Are these
dependencies
documented somewhere ? I'd like to have a fully building system.

create-cppunit-configure:
     [exec] Can't exec "libtoolize": No such file or directory
at
/usr/bin/autoreconf line 188.
     [exec] Use of uninitialized value $libtoolize in pattern
match
(m//) at /usr/bin/autoreconf line 188.
     [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT'
not
found
in library
     [exec] configure.ac:33: error: possibly undefined macro:
AM_PATH_CPPUNIT
     [exec]       If this token and others are legitimate,
please
use
m4_pattern_allow.
     [exec]       See the Autoconf documentation.
     [exec] configure.ac:53: error: possibly undefined macro:
AC_PROG_LIBTOOL
     [exec] autoreconf: /usr/bin/autoconf failed with exit
status:
1
You need auto tools to run this. Please read the README for
building c
client library at src/c/ for the installation requirements.
3. Sync failure:

This is still failing.

svn: URL

'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
doesn't exist

Yes this hasn't been fixed yet!

Thanks
mahadev
-Todd

-----Original Message-----
From: Todd Greenwood
Sent: Tuesday, August 04, 2009 11:26 AM
To: 'zookeeper-u...@hadoop.apache.org'
Subject: RE: Unending Leader Elections in WAN deploy

Great news. Thank you Mahadev. I'll report our findings later
today.
-Todd

-----Original Message-----
From: Mahadev Konar [mailto:maha...@yahoo-inc.com]
Sent: Tuesday, August 04, 2009 11:20 AM
To: zookeeper-u...@hadoop.apache.org
Subject: Re: Unending Leader Elections in WAN deploy

Hi Todd,
 I just committed 480 and 491. You can checkout the 3.2 branch
now.
Thanks
mahadev


On 8/3/09 4:29 PM, "Todd Greenwood"
<to...@audiencescience.com>
wrote:
That'd be perfect. Thanks!

-----Original Message-----
From: Mahadev Konar [mailto:maha...@yahoo-inc.com]
Sent: Monday, August 03, 2009 4:24 PM
To: zookeeper-u...@hadoop.apache.org
Subject: Re: Unending Leader Elections in WAN deploy

Hi Todd,
  Most of the patches that you mention should be in the
branch
3.2 by
tomm
or so. 481, 479 are already in. 480 and 491 should be in by
tomm.
Would
that
suffice for you?

Thanks
mahadev


On 8/3/09 4:21 PM, "Todd Greenwood"
<to...@audiencescience.com>
wrote:
Another problem...I've reverted to the latest versions of
the
patches
that are not specific to branch-3.2, and I'm getting two
compilation
errors:

build-generated:
    [javac] Compiling 44 source files to

/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/build/classes

compile-main:
    [javac] Compiling 2 source files to

/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/build/classes
    [javac]

/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-
3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
mStats.java:30: name clash: getQuorumPeers() and
getQuorumPeers()
have
the same erasure
    [javac]         public String[] getQuorumPeers();
    [javac]                         ^
    [javac]

/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-
3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
mStats.java:31: name clash: getServerState() and
getServerState()
have
the same erasure
    [javac]         public String getServerState();
    [javac]                       ^
    [javac] 2 errors

My build process is pretty simple:

1. copy the branch-3.2 source to a temp directory
(src/patched/branch-3.2)
2. apply the ZOOKEEPER patches in my patches directory
3. build zookeeper in the temp directory

-Todd
-----Original Message-----
From: Todd Greenwood [mailto:to...@audiencescience.com]
Sent: Monday, August 03, 2009 4:09 PM
To: zookeeper-u...@hadoop.apache.org
Subject: RE: Unending Leader Elections in WAN deploy

Flavio,
I notice that you've updated the patches referenced for
the
WAN
deployment. There appears to be an order dependency w/
respect
to
these
four patches...

ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch

473 -> 479 (479 fails)


to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
/src/patched/branch-3.2$ patch -p0 <
../patches/ZOOKEEPER-479-branch3.2.patch
patching file

src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
ical.java
patching file

src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
patching file

src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
.java
patching file

src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
Hunk #1 FAILED at 93.
Hunk #2 FAILED at 145.
2 out of 2 hunks FAILED -- saving rejects to file

src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
/src/patched/branch-3.2$ h ../patches/

Could you advise as to which patches I need to apply, and
in
what
order?
-Todd

-----Original Message-----
From: Flavio Junqueira [mailto:f...@yahoo-inc.com]
Sent: Friday, July 31, 2009 9:51 PM
To: zookeeper-u...@hadoop.apache.org
Subject: Re: Unending Leader Elections in WAN deploy

Perfect! Thanks for the update, Todd.

-Flavio

On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:

Thanks. You were right, I had a stale version of 479.
Compilation
succeeds and all tests pass on branch-3.2 with the
latest
patches
473,
479, 481, and 491.

-Todd

-----Original Message-----
From: Flavio Junqueira [mailto:f...@yahoo-inc.com]
Sent: Friday, July 31, 2009 7:48 PM
To: zookeeper-u...@hadoop.apache.org
Subject: Re: Unending Leader Elections in WAN deploy

It should be in 479. Perhaps you have a stale version
of
the
patch.
-Flavio

On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:

Flavio,

I'm getting a compilation error for patch 491:

compile-main:
  [javac] Compiling 1 source file to

/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
src/p
atched/branch-3.2/build/classes
  [javac]

/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
src/p

atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
FastL
eaderElection.java:601: cannot find symbol
  [javac] symbol  : method getWeight(long)
  [javac] location: interface

org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
  [javac]
if(self.getQuorumVerifier().getWeight(n.sid) != 0)
  [javac]
^
  [javac] 1 error

I see a reference to getWeight in both
FastLeaderElection.java
in
patch
491:

patches/ZOOKEEPER-491.patch:+
if(self.getQuorumVerifier().getWeight(n.sid) != 0)
src/java/main/org/apache/zookeeper/server/quorum/
FastLeaderElection.java
:
if(self.getQuorumVerifier().getWeight(n.sid) !=
0)

However, I don't see a reference to this method in
patches
473,
479,
or
481. I also don't see a reference to this method in
the
trunk...
-Todd

-----Original Message-----
From: Todd Greenwood
[mailto:to...@audiencescience.com]
Sent: Friday, July 31, 2009 7:30 PM
To: zookeeper-u...@hadoop.apache.org
Subject: RE: Unending Leader Elections in WAN deploy

Ok, I'll apply that patch and report back.
-Todd

-----Original Message-----
From: Flavio Junqueira [mailto:f...@yahoo-inc.com]
Sent: Friday, July 31, 2009 7:18 PM
To: zookeeper-u...@hadoop.apache.org
Subject: Re: Unending Leader Elections in WAN deploy

You're missing 491 from your set of patches.

-Flavio

On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:

This repro's in both branch-3.2, and
branch-3.2+patches(473,
479,
481).

Basically, it seems like the nodes are electing
pd4-zook02
to
be
the
leader. However, pd4-zook02 seems to realize it's not
supposed
to
be
and
then disconnects everyone. Then they re-elect it
again,
and
it
loops
over and over.

-------------
Server config
-------------

server.1=dc1-zook01.dc01.revsci.net:2888:3888
server.2=dc1-zook02.dc01.revsci.net:2888:3888
server.3=dc1-zook03.dc01.revsci.net:2888:3888
server.4=dc1-zook04.dc01.revsci.net:2888:3888
server.5=dc1-zook05.dc01.revsci.net:2888:3888
server.6=pd1-zook01.pd01.revsci.net:2888:3888
server.7=pd1-zook02.pd01.revsci.net:2888:3888
server.8=pd4-zook01.iad1.audsci.net:2888:3888
server.9=pd4-zook02.iad1.audsci.net:2888:3888

group.1:1:2:3:4:5
weight.1=1
weight.2=1
weight.3=1
weight.4=1
weight.5=1

group.2:6:7:8:9
weight.6=0
weight.7=0
weight.8=0
weight.9=0

Note that we have 2 groups, composed of machines in 3
different
locations (dc1, pd1, and pd4). The idea is that only
machines
in
dc1
have voting rights, and the ability to become a
leader.
The
machines
in
the pods all have a weight of zero, and are not
expected
to
become
leaders, or to vote on transactions.

Let me know what I can do to help resolve this issue.

-Todd

Reply via email to