RE: How do we find the Server the client is connected to?

2009-10-01 Thread Todd Greenwood
Failover testing.

 -Original Message-
 From: Patrick Hunt [mailto:ph...@apache.org]
 Sent: Thursday, October 01, 2009 3:44 PM
 To: zookeeper-user@hadoop.apache.org; Rob Baccus
 Subject: Re: How do we find the Server the client is connected to?
 
 That detail is purposefully not exposed through the client api,
however
 it is output to the log on connection establishment.
 
 Why would your client code need to know which server in the ensemble
it
 is connected to?
 
 Patrick
 
 Rob Baccus wrote:
  How do I determine the server the client is connected to?  It is not
  exposed as far as I can see in either the ZooKeep object or the
  ClentCnxn object.  I did find on line 790 in
ClientCnxn.StartConnect()
  method the place the actual server connection is happening but that
is
  not exposed.
 
  Rob Baccus
  425-201-3812
 
 


RE: ACL question w/ Zookeeper 3.1.1

2009-09-21 Thread Todd Greenwood
 = false
eventOfDeath = {java.lang.obj...@1392}
lastZxid = 1
xid = 3
response = {org.apache.zookeeper.proto.createrespo...@1365}\n
r = {org.apache.zookeeper.proto.replyhea...@1445}0,0,-112\n
request =
{org.apache.zookeeper.proto.createrequ...@1360}'/ACLTest,,v{s{31,s{'aut
h,'}}},0\n
path = {java.lang.str...@1314}/ACLTest
data = {byte...@1339}
acl = {java.util.arrayl...@1242} size = 1
flags = 0
path = {java.lang.str...@1314}/ACLTest
h = {org.apache.zookeeper.proto.requesthea...@1352}2,1\n
cnxn = {org.apache.zookeeper.clientc...@1381}sessionId:
0x123de5b3b1b\nlastZxid: 1\nxid: 3\nnextAddrToTry: 0\nserverAddrs:
/127.0.0.1:2181\n


--
v5

NOTE: If I use Ids.OPEN_ACL_UNSAFE, then everything works fine. Here's
an example of the debug state after a create()...
--

this = {org.apache.zookeeper.zookee...@1266}
watchManager = {org.apache.zookeeper.zookeeper$zkwatchmana...@1397}
state = {org.apache.zookeeper.zookeeper$sta...@1398}CONNECTED
cnxn = {org.apache.zookeeper.clientc...@1374}sessionId:
0x123de6ba8de\nlastZxid: 2\nxid: 3\nnextAddrToTry: 0\nserverAddrs:
/127.0.0.1:2181\n
serverAddrs = {java.util.arrayl...@1403} size = 1
authInfo = {java.util.arrayl...@1404} size = 1
[0] = {org.apache.zookeeper.clientcnxn$authd...@1415}
scheme = {java.lang.str...@1244}digest
data = {byte[...@1416}
pendingQueue = {java.util.linkedl...@1405} size = 0
outgoingQueue = {java.util.linkedl...@1406} size = 0
nextAddrToTry = 0
connectTimeout = 4
readTimeout = 2
sessionTimeout = 5
zooKeeper = {org.apache.zookeeper.zookee...@1266}
watcher = {org.apache.zookeeper.zookeeper$zkwatchmana...@1397}
sessionId = 82153772198789120
sessionPasswd = {byte[...@1407}
sendThread =
{org.apache.zookeeper.clientcnxn$sendthr...@1259}Thread[main-SendThread
,5,main]
eventThread =
{org.apache.zookeeper.clientcnxn$eventthr...@1265}Thread[main-EventThre
ad,5,main]
selector = {sun.nio.ch.epollselectori...@1408}
closing = false
eventOfDeath = {java.lang.obj...@1409}
lastZxid = 2
xid = 3
response = {org.apache.zookeeper.proto.createrespo...@1360}'/ACLTest\n
r = {org.apache.zookeeper.proto.replyhea...@1389}2,2,0\n
xid = 2
zxid = 2
err = 0
request =
{org.apache.zookeeper.proto.createrequ...@1355}'/ACLTest,,v{s{15,s{'wor
ld,'anyone}}},0\n
path = {java.lang.str...@1314}/ACLTest
h = {org.apache.zookeeper.proto.requesthea...@1347}2,1\n
cnxn = {org.apache.zookeeper.clientc...@1374}sessionId:
0x123de6ba8de\nlastZxid: 2\nxid: 3\nnextAddrToTry: 0\nserverAddrs:
/127.0.0.1:2181\n

 -Original Message-
 From: Todd Greenwood [mailto:to...@audiencescience.com]
 Sent: Friday, September 18, 2009 11:27 AM
 To: Patrick Hunt; zookeeper-...@hadoop.apache.org; zookeeper-
 u...@hadoop.apache.org
 Subject: RE: ACL question w/ Zookeeper 3.1.1
 
 Patrick / Mahadev,
 
 Thanks for the heads-up!
 
 Apparently I *am* receiving email from zookeeper-user but it is being
 filtered out as spam. This just started happening, but I'll rectify on
 my end.
 
 I'm working thru Mahadev's response and will respond shortly (and
search
 for other postings, as well). Appologies for the cross post.
 
 -Todd
 
  -Original Message-
  From: Patrick Hunt [mailto:ph...@apache.org]
  Sent: Friday, September 18, 2009 11:19 AM
  To: zookeeper-...@hadoop.apache.org;
zookeeper-user@hadoop.apache.org
  Cc: Todd Greenwood
  Subject: Re: ACL question w/ Zookeeper 3.1.1
 
  Todd, there were other responses as well. Are you seeing other
traffic
  from the lists? (perhaps a spam filtering issue?)
 
  Patrick
 
  Mahadev Konar wrote:
   HI todd,
 We did respond on zookeeper-user. Here is my response in case
you
  didn't
   see it...
  
  
   HI todd,
From what I understand, you are sayin that a creator_all_acl does
 not
  work
   with auth?
  
I tried the following with CREATOR_ALL_ACL and it seemed to work
 for
  me...
  
   import org.apache.zookeeper.CreateMode;
   import org.apache.zookeeper.WatchedEvent;
   import org.apache.zookeeper.Watcher;
   import org.apache.zookeeper.ZooKeeper;
   import org.apache.zookeeper.data.ACL;
   import org.apache.zookeeper.ZooDefs.Ids;
   import java.util.ArrayList;
   import java.util.List;
  
   public class TestACl implements Watcher {
  
   public static void main(String[] argv) throws Exception {
   ListACL acls = new ArrayListACL(1);
   String authentication_type = digest;
   String authentication = mahadev:some;
  
   for (ACL ids_acl : Ids.CREATOR_ALL_ACL) {
   acls.add(ids_acl);
   }
   TestACl tacl = new TestACl();
   ZooKeeper zoo = new ZooKeeper(localhost:2181, 3000,
tacl);
   zoo.addAuthInfo(authentication_type,
 authentication.getBytes());
   zoo.create(/some, new byte[0], acls,
 CreateMode.PERSISTENT);
   zoo.setData(/some, new byte[0], -1);
   }
  
   @Override
   public void process(WatchedEvent event) {
  
  
   }
   }
  
  
   And it worked

RE: ACL question w/ Zookeeper 3.1.1

2009-09-21 Thread Todd Greenwood
Patrick,

In v3/4, I am using Ids.CREATOR_ALL_ACL. In v5 Ids.OPEN_ACL_UNSAFE. In
all cases, ACLs are specified and authentication credentials have been
added to zookeeper instance.

--
CODE
---
// v5
//for ( ACL ids_acl : Ids.CREATOR_ALL_ACL )
//{
//acl.add( ids_acl );
//}

// v3/4
for ( ACL ids_acl : Ids.OPEN_ACL_UNSAFE )
{
acl.add( ids_acl );
}

// all cases (v3,4,5) have authentication credentials set
zoo = new ZooKeeper( connection_string, connectiontimeout, this );
zoo.addAuthInfo( authentication_type, authentication.getBytes() );

// all cases (v3,4,5) use the acl defined above
zoo.create( normPath(path), new byte[0], acl, mode );

I'll investigate further and log a bug if I can isolate this.

-Todd

 -Original Message-
 From: Patrick Hunt [mailto:ph...@apache.org]
 Sent: Monday, September 21, 2009 4:32 PM
 To: zookeeper-user@hadoop.apache.org; Todd Greenwood
 Cc: Patrick Hunt
 Subject: Re: ACL question w/ Zookeeper 3.1.1
 
 Todd Greenwood wrote:
  Patrick,
 
  Thanks, I'll spend some more time trying to create a more concise
repro,
  and log a bug once I do. The only reason I posted this mash was to
see
  if the replyHeader error, 0,0,-112, made sense of the ACL
exception.
 
  The rest is just context...and clearly too much of that :o). I don't
see
  a difference between v3 and v4...The only differences that I can see
are
  the between v4 and v5 (v4 fails and v5 succeeds):
 
 I did see this diff btw 3/4, 3 has this:
 
 request =
 {org.apache.zookeeper.proto.createrequ...@1360}'/ACLTest,,v{},0\n
 
 you don't have any acl specified for the node create, or is this
 supposed to be a working example w/o auth? (like I said, I'm
confused...)
 
 
  v4:
  response = {org.apache.zookeeper.proto.createrespo...@1365}\n
  r = {org.apache.zookeeper.proto.replyhea...@1445}0,0,-112\n
 
 
 -112 return code is session expired, not auth failure. according to
 this your client's session expired, but w/o more info (code/log or
idea
 of what your test is doing) I can't really speculate why you are
getting
 this (old client session that was not shutdown correctly and finally
 expired while running a different/new test?)
 
 Patrick
 
  v5:
  response =
  {org.apache.zookeeper.proto.createrespo...@1360}'/ACLTest\n
  r = {org.apache.zookeeper.proto.replyhea...@1389}2,2,0\n
 
  -Todd
 
  -Original Message-
  From: Patrick Hunt [mailto:ph...@apache.org]
  Sent: Monday, September 21, 2009 4:14 PM
  To: zookeeper-user@hadoop.apache.org; Todd Greenwood
  Subject: Re: ACL question w/ Zookeeper 3.1.1
 
  Todd, I spent some time looking at your output and honestly I'm
having
  trouble making sense of what you are saying. What's the diff btw v3

  v4? I'm afraid here are too many variables, can you help nail
things
  down?
  1) create a jira for this
  https://issues.apache.org/jira/browse/ZOOKEEPER
 
  2) if at all possible attach the code you are running that has
  problems,
  seems like you've boiled it down to a case where it is
deterministic,
  this would be the best for us to debug. If you can't attach the
code
  then include snippets - in particular the addAuthInfo call
  (w/parameter
  details) for your clients, and the individual create calls,
including
  the acl specifics - and describe what your client(s) are doing in
  detail
  so that we can attempt to reproduce.
 
  3) attach a trace level log from both the server and client during
  your
  test run, point out the time index when you see the auth failure.
 
 
  btw, you might try doing a getACL(path...) just before the
operation
  that's failing - it will give you some insight into what the acl is
  set
  to for that node.
 
  Patrick
 
  Todd Greenwood wrote:
  Patrick / Mahadev,
 
  I've spent the last couple of days attempting to isolate this
issue,
  and
  this is what I've come up with...
 
  Mahadev's simple use case works fine, as posted. However, my more
  involved use cases are consistently failing w/ InvalidACL
exceptions
  when I use digest authentication with Ids.CREATOR_ALL_ACL:
 
  java.lang.Exception:
  com.audiencescience.util.zookeeper.wrapper.ZooWrapperException:
  org.apache.zookeeper.KeeperException$InvalidACLException:
  KeeperErrorCode = InvalidACL for /ACLTest
 
  Prior to throwing this exception, the response is
  (Zookeeper.java:create()):
  r = {org.apache.zookeeper.proto.replyhea...@1445}0,0,-112\n
  mailto:{org.apache.zookeeper.proto.replyhea...@1445} . More
debug
  data below.
 
  So, while I can get Mahadev's simple example to work, I cannot get
a
  more involved use case to work correctly. However, if I change my
  code
  to use Ids.OPEN_ACL_UNSAFE, then everything works fine. Example
  debug
  output below at v5.
 
  Could someone point me at non-trivial test cases for ACLs, and
  perhaps
  give me some insight into how to debug this issue further?
 
  -Todd
 
 
  ---
  Code Snippet

ACL question w/ Zookeeper 3.1.1

2009-09-17 Thread Todd Greenwood
I'm attempting to secure a zookeeper installation using zookeeper ACLs.
However, I'm finding that while Ids.OPEN_ACL_UNSAFE works great, my
attempts at using Ids.CREATOR_ALL_ACL are failing. Here's a code
snippet:


public class ZooWrapper
{

/*
1. Here I'm setting up my authentication. I've got an ACL list, and my
authentication strings.
*/
private final ListACL acl = new ArrayListACL( 1 );
private static final String authentication_type = digest;
private static final String authentication =
audiencescience:gravy;


public ZooWrapper( final String connection_string,
   final String path,
   final int connectiontimeout ) throws
ZooWrapperException
{
...
/*
2. Here I'm adding the acls
*/

// This works (creates nodes, sets data on nodes)
for ( ACL ids_acl : Ids.OPEN_ACL_UNSAFE )
{
acl.add( ids_acl);
}

/*
NOTE:  This does not work (nodes are not created, cannot set data on
nodes b/c nodes do not exist)
*/

//for ( ACL ids_acl : Ids.CREATOR_ALL_ACL )
//{
//acl.add( ids_acl );
//}

/*
3. Finally, I create a new zookeeper instance and add my authorization
info to it.
*/
 zoo = new ZooKeeper( connection_string, connectiontimeout, this );
 zoo.addAuthInfo( authentication_type, authentication.getBytes() )

/*
4. Later, I try to write some data into zookeeper by first creating the
node, and then calling setdata...
*/
  zoo.create( path, new byte[0], acl, CreateMode.PERSISTENT );

  zoo.setData( path, bytes, -1 )

As I mentioned above, when I add Ids.OPEN_ACL_UNSAFE to acl, then both
the create and setData succeed. However, when I use Ids.CREATOR_ALL_ACL,
then the nodes are not created. Am I missing something obvious w/
respect to configuring ACLs?

I've used the following references:

http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperProgrammers.html

http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-commits/200807
.mbox/%3c20080731201025.c62092388...@eris.apache.org%3e

http://books.google.com/books?id=bKPEwR-Pt6ECpg=PT404lpg=PT404dq=zook
eeper+ACL+digest+%22new+Id%22source=blots=kObz0y8eFksig=VFCAsNW0mBJyZ
swoweJDI31iNlohl=enei=Z82ySojRFsqRlAeqxsyIDwsa=Xoi=book_resultct=re
sultresnum=6#v=onepageq=zookeeper%20ACL%20digest%20%22new%20Id%22f=fa
lse

-Todd


RE: Unending Leader Elections in WAN deploy

2009-08-04 Thread Todd Greenwood
Great news. Thank you Mahadev. I'll report our findings later today.
-Todd

 -Original Message-
 From: Mahadev Konar [mailto:maha...@yahoo-inc.com]
 Sent: Tuesday, August 04, 2009 11:20 AM
 To: zookeeper-user@hadoop.apache.org
 Subject: Re: Unending Leader Elections in WAN deploy
 
 Hi Todd,
  I just committed 480 and 491. You can checkout the 3.2 branch now.
 
 Thanks
 mahadev
 
 
 On 8/3/09 4:29 PM, Todd Greenwood to...@audiencescience.com wrote:
 
  That'd be perfect. Thanks!
 
  -Original Message-
  From: Mahadev Konar [mailto:maha...@yahoo-inc.com]
  Sent: Monday, August 03, 2009 4:24 PM
  To: zookeeper-user@hadoop.apache.org
  Subject: Re: Unending Leader Elections in WAN deploy
 
  Hi Todd,
Most of the patches that you mention should be in the branch 3.2
by
  tomm
  or so. 481, 479 are already in. 480 and 491 should be in by tomm.
  Would
  that
  suffice for you?
 
  Thanks
  mahadev
 
 
  On 8/3/09 4:21 PM, Todd Greenwood to...@audiencescience.com
wrote:
 
  Another problem...I've reverted to the latest versions of the
  patches
  that are not specific to branch-3.2, and I'm getting two
compilation
  errors:
 
  build-generated:
  [javac] Compiling 44 source files to
 
 
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
  atched/branch-3.2/build/classes
 
  compile-main:
  [javac] Compiling 2 source files to
 
 
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
  atched/branch-3.2/build/classes
  [javac]
 
 
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
 
 
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
  mStats.java:30: name clash: getQuorumPeers() and getQuorumPeers()
  have
  the same erasure
  [javac] public String[] getQuorumPeers();
  [javac] ^
  [javac]
 
 
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
 
 
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
  mStats.java:31: name clash: getServerState() and getServerState()
  have
  the same erasure
  [javac] public String getServerState();
  [javac]   ^
  [javac] 2 errors
 
  My build process is pretty simple:
 
  1. copy the branch-3.2 source to a temp directory
  (src/patched/branch-3.2)
  2. apply the ZOOKEEPER patches in my patches directory
  3. build zookeeper in the temp directory
 
  -Todd
  -Original Message-
  From: Todd Greenwood [mailto:to...@audiencescience.com]
  Sent: Monday, August 03, 2009 4:09 PM
  To: zookeeper-user@hadoop.apache.org
  Subject: RE: Unending Leader Elections in WAN deploy
 
  Flavio,
  I notice that you've updated the patches referenced for the WAN
  deployment. There appears to be an order dependency w/ respect to
  these
  four patches...
 
  ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
  ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
 
  473 - 479 (479 fails)
 
 
 
 
to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
  /src/patched/branch-3.2$ patch -p0 
  ../patches/ZOOKEEPER-479-branch3.2.patch
  patching file
 
 
 
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
  ical.java
  patching file
 
 
 
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
  patching file
 
 
 
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
  .java
  patching file
 
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
  Hunk #1 FAILED at 93.
  Hunk #2 FAILED at 145.
  2 out of 2 hunks FAILED -- saving rejects to file
 
 
 
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
 
 
 
to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
  /src/patched/branch-3.2$ h ../patches/
 
  Could you advise as to which patches I need to apply, and in what
  order?
 
  -Todd
 
  -Original Message-
  From: Flavio Junqueira [mailto:f...@yahoo-inc.com]
  Sent: Friday, July 31, 2009 9:51 PM
  To: zookeeper-user@hadoop.apache.org
  Subject: Re: Unending Leader Elections in WAN deploy
 
  Perfect! Thanks for the update, Todd.
 
  -Flavio
 
  On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
 
  Thanks. You were right, I had a stale version of 479.
Compilation
  succeeds and all tests pass on branch-3.2 with the latest
patches
  473,
  479, 481, and 491.
 
  -Todd
 
  -Original Message-
  From: Flavio Junqueira [mailto:f...@yahoo-inc.com]
  Sent: Friday, July 31, 2009 7:48 PM
  To: zookeeper-user@hadoop.apache.org
  Subject: Re: Unending Leader Elections in WAN deploy
 
  It should be in 479. Perhaps you have a stale version of the
  patch.
 
  -Flavio
 
  On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
 
  Flavio,
 
  I'm getting a compilation error for patch 491:
 
  compile-main:
[javac] Compiling 1 source file to
 
 
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
  src/p

Zookeeper Client

2009-08-04 Thread Todd Greenwood
Is org.apache.zookeeper.ZooKeeper thread safe?

I've started walking through the code to check for mutability, and
although the first level children are protected, I haven't fully walked
the graph. Perhaps I should ask, is it supposed to be thread safe?

-Todd


RE: Unending Leader Elections in WAN deploy

2009-08-03 Thread Todd Greenwood
Flavio,
I notice that you've updated the patches referenced for the WAN
deployment. There appears to be an order dependency w/ respect to these
four patches...

ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch

473 - 479 (479 fails)

to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
/src/patched/branch-3.2$ patch -p0 
../patches/ZOOKEEPER-479-branch3.2.patch 
patching file
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
ical.java
patching file
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
patching file
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
.java
patching file
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
Hunk #1 FAILED at 93.
Hunk #2 FAILED at 145.
2 out of 2 hunks FAILED -- saving rejects to file
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
/src/patched/branch-3.2$ h ../patches/

Could you advise as to which patches I need to apply, and in what order?

-Todd

 -Original Message-
 From: Flavio Junqueira [mailto:f...@yahoo-inc.com]
 Sent: Friday, July 31, 2009 9:51 PM
 To: zookeeper-user@hadoop.apache.org
 Subject: Re: Unending Leader Elections in WAN deploy
 
 Perfect! Thanks for the update, Todd.
 
 -Flavio
 
 On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
 
  Thanks. You were right, I had a stale version of 479. Compilation
  succeeds and all tests pass on branch-3.2 with the latest patches
473,
  479, 481, and 491.
 
  -Todd
 
  -Original Message-
  From: Flavio Junqueira [mailto:f...@yahoo-inc.com]
  Sent: Friday, July 31, 2009 7:48 PM
  To: zookeeper-user@hadoop.apache.org
  Subject: Re: Unending Leader Elections in WAN deploy
 
  It should be in 479. Perhaps you have a stale version of the patch.
 
  -Flavio
 
  On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
 
  Flavio,
 
  I'm getting a compilation error for patch 491:
 
  compile-main:
[javac] Compiling 1 source file to
 
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
  src/p
  atched/branch-3.2/build/classes
[javac]
 
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
  src/p
 
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
  FastL
  eaderElection.java:601: cannot find symbol
[javac] symbol  : method getWeight(long)
[javac] location: interface
  org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
[javac]
  if(self.getQuorumVerifier().getWeight(n.sid) != 0)
[javac]^
[javac] 1 error
 
  I see a reference to getWeight in both FastLeaderElection.java in
  patch
  491:
 
  patches/ZOOKEEPER-491.patch:+
  if(self.getQuorumVerifier().getWeight(n.sid) != 0)
  src/java/main/org/apache/zookeeper/server/quorum/
  FastLeaderElection.java
  :
  if(self.getQuorumVerifier().getWeight(n.sid) !=
  0)
 
  However, I don't see a reference to this method in patches 473,
479,
  or
  481. I also don't see a reference to this method in the trunk...
 
  -Todd
 
  -Original Message-
  From: Todd Greenwood [mailto:to...@audiencescience.com]
  Sent: Friday, July 31, 2009 7:30 PM
  To: zookeeper-user@hadoop.apache.org
  Subject: RE: Unending Leader Elections in WAN deploy
 
  Ok, I'll apply that patch and report back.
  -Todd
 
  -Original Message-
  From: Flavio Junqueira [mailto:f...@yahoo-inc.com]
  Sent: Friday, July 31, 2009 7:18 PM
  To: zookeeper-user@hadoop.apache.org
  Subject: Re: Unending Leader Elections in WAN deploy
 
  You're missing 491 from your set of patches.
 
  -Flavio
 
  On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
 
  This repro's in both branch-3.2, and branch-3.2+patches(473,
479,
  481).
 
  Basically, it seems like the nodes are electing pd4-zook02 to
be
  the
  leader. However, pd4-zook02 seems to realize it's not supposed
to
  be
  and
  then disconnects everyone. Then they re-elect it again, and it
  loops
  over and over.
 
  -
  Server config
  -
 
  server.1=dc1-zook01.dc01.revsci.net:2888:3888
  server.2=dc1-zook02.dc01.revsci.net:2888:3888
  server.3=dc1-zook03.dc01.revsci.net:2888:3888
  server.4=dc1-zook04.dc01.revsci.net:2888:3888
  server.5=dc1-zook05.dc01.revsci.net:2888:3888
  server.6=pd1-zook01.pd01.revsci.net:2888:3888
  server.7=pd1-zook02.pd01.revsci.net:2888:3888
  server.8=pd4-zook01.iad1.audsci.net:2888:3888
  server.9=pd4-zook02.iad1.audsci.net:2888:3888
 
  group.1:1:2:3:4:5
  weight.1=1
  weight.2=1
  weight.3=1
  weight.4=1
  weight.5=1
 
  group.2:6:7:8:9
  weight.6=0
  weight.7=0
  weight.8=0
  weight.9=0
 
  Note that we have 2 groups, composed of machines in 3 different
  locations (dc1, pd1, and pd4). The idea is that only machines
in
  dc1
  have voting rights, and the ability to become a leader. The
  machines

RE: Unending Leader Elections in WAN deploy

2009-08-03 Thread Todd Greenwood
Another problem...I've reverted to the latest versions of the patches
that are not specific to branch-3.2, and I'm getting two compilation
errors:

build-generated:
[javac] Compiling 44 source files to
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/build/classes

compile-main:
[javac] Compiling 2 source files to
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/build/classes
[javac]
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
mStats.java:30: name clash: getQuorumPeers() and getQuorumPeers() have
the same erasure
[javac] public String[] getQuorumPeers();
[javac] ^
[javac]
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
mStats.java:31: name clash: getServerState() and getServerState() have
the same erasure
[javac] public String getServerState();
[javac]   ^
[javac] 2 errors

My build process is pretty simple:

1. copy the branch-3.2 source to a temp directory
(src/patched/branch-3.2)
2. apply the ZOOKEEPER patches in my patches directory
3. build zookeeper in the temp directory

-Todd
 -Original Message-
 From: Todd Greenwood [mailto:to...@audiencescience.com]
 Sent: Monday, August 03, 2009 4:09 PM
 To: zookeeper-user@hadoop.apache.org
 Subject: RE: Unending Leader Elections in WAN deploy
 
 Flavio,
 I notice that you've updated the patches referenced for the WAN
 deployment. There appears to be an order dependency w/ respect to
these
 four patches...
 
 ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
 ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
 
 473 - 479 (479 fails)
 

to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
 /src/patched/branch-3.2$ patch -p0 
 ../patches/ZOOKEEPER-479-branch3.2.patch
 patching file

src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
 ical.java
 patching file

src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
 patching file

src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
 .java
 patching file
 src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
 Hunk #1 FAILED at 93.
 Hunk #2 FAILED at 145.
 2 out of 2 hunks FAILED -- saving rejects to file

src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej

to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
 /src/patched/branch-3.2$ h ../patches/
 
 Could you advise as to which patches I need to apply, and in what
order?
 
 -Todd
 
  -Original Message-
  From: Flavio Junqueira [mailto:f...@yahoo-inc.com]
  Sent: Friday, July 31, 2009 9:51 PM
  To: zookeeper-user@hadoop.apache.org
  Subject: Re: Unending Leader Elections in WAN deploy
 
  Perfect! Thanks for the update, Todd.
 
  -Flavio
 
  On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
 
   Thanks. You were right, I had a stale version of 479. Compilation
   succeeds and all tests pass on branch-3.2 with the latest patches
 473,
   479, 481, and 491.
  
   -Todd
  
   -Original Message-
   From: Flavio Junqueira [mailto:f...@yahoo-inc.com]
   Sent: Friday, July 31, 2009 7:48 PM
   To: zookeeper-user@hadoop.apache.org
   Subject: Re: Unending Leader Elections in WAN deploy
  
   It should be in 479. Perhaps you have a stale version of the
patch.
  
   -Flavio
  
   On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
  
   Flavio,
  
   I'm getting a compilation error for patch 491:
  
   compile-main:
 [javac] Compiling 1 source file to
  
 /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
   src/p
   atched/branch-3.2/build/classes
 [javac]
  
 /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
   src/p
  
 atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
   FastL
   eaderElection.java:601: cannot find symbol
 [javac] symbol  : method getWeight(long)
 [javac] location: interface
   org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
 [javac]
   if(self.getQuorumVerifier().getWeight(n.sid) != 0)
 [javac]^
 [javac] 1 error
  
   I see a reference to getWeight in both FastLeaderElection.java
in
   patch
   491:
  
   patches/ZOOKEEPER-491.patch:+
   if(self.getQuorumVerifier().getWeight(n.sid) != 0)
   src/java/main/org/apache/zookeeper/server/quorum/
   FastLeaderElection.java
   :
   if(self.getQuorumVerifier().getWeight(n.sid) !=
   0)
  
   However, I don't see a reference to this method in patches 473,
 479,
   or
   481. I also don't see a reference to this method in the trunk...
  
   -Todd
  
   -Original Message-
   From: Todd Greenwood [mailto:to...@audiencescience.com

RE: Unending Leader Elections in WAN deploy

2009-08-03 Thread Todd Greenwood
That'd be perfect. Thanks!

 -Original Message-
 From: Mahadev Konar [mailto:maha...@yahoo-inc.com]
 Sent: Monday, August 03, 2009 4:24 PM
 To: zookeeper-user@hadoop.apache.org
 Subject: Re: Unending Leader Elections in WAN deploy
 
 Hi Todd,
   Most of the patches that you mention should be in the branch 3.2 by
tomm
 or so. 481, 479 are already in. 480 and 491 should be in by tomm.
Would
 that
 suffice for you?
 
 Thanks
 mahadev
 
 
 On 8/3/09 4:21 PM, Todd Greenwood to...@audiencescience.com wrote:
 
  Another problem...I've reverted to the latest versions of the
patches
  that are not specific to branch-3.2, and I'm getting two compilation
  errors:
 
  build-generated:
  [javac] Compiling 44 source files to
 
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
  atched/branch-3.2/build/classes
 
  compile-main:
  [javac] Compiling 2 source files to
 
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
  atched/branch-3.2/build/classes
  [javac]
 
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
 
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
  mStats.java:30: name clash: getQuorumPeers() and getQuorumPeers()
have
  the same erasure
  [javac] public String[] getQuorumPeers();
  [javac] ^
  [javac]
 
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
 
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
  mStats.java:31: name clash: getServerState() and getServerState()
have
  the same erasure
  [javac] public String getServerState();
  [javac]   ^
  [javac] 2 errors
 
  My build process is pretty simple:
 
  1. copy the branch-3.2 source to a temp directory
  (src/patched/branch-3.2)
  2. apply the ZOOKEEPER patches in my patches directory
  3. build zookeeper in the temp directory
 
  -Todd
  -Original Message-
  From: Todd Greenwood [mailto:to...@audiencescience.com]
  Sent: Monday, August 03, 2009 4:09 PM
  To: zookeeper-user@hadoop.apache.org
  Subject: RE: Unending Leader Elections in WAN deploy
 
  Flavio,
  I notice that you've updated the patches referenced for the WAN
  deployment. There appears to be an order dependency w/ respect to
  these
  four patches...
 
  ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
  ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
 
  473 - 479 (479 fails)
 
 
 
to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
  /src/patched/branch-3.2$ patch -p0 
  ../patches/ZOOKEEPER-479-branch3.2.patch
  patching file
 
 
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
  ical.java
  patching file
 
 
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
  patching file
 
 
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
  .java
  patching file
  src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
  Hunk #1 FAILED at 93.
  Hunk #2 FAILED at 145.
  2 out of 2 hunks FAILED -- saving rejects to file
 
 
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
 
 
to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
  /src/patched/branch-3.2$ h ../patches/
 
  Could you advise as to which patches I need to apply, and in what
  order?
 
  -Todd
 
  -Original Message-
  From: Flavio Junqueira [mailto:f...@yahoo-inc.com]
  Sent: Friday, July 31, 2009 9:51 PM
  To: zookeeper-user@hadoop.apache.org
  Subject: Re: Unending Leader Elections in WAN deploy
 
  Perfect! Thanks for the update, Todd.
 
  -Flavio
 
  On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
 
  Thanks. You were right, I had a stale version of 479. Compilation
  succeeds and all tests pass on branch-3.2 with the latest patches
  473,
  479, 481, and 491.
 
  -Todd
 
  -Original Message-
  From: Flavio Junqueira [mailto:f...@yahoo-inc.com]
  Sent: Friday, July 31, 2009 7:48 PM
  To: zookeeper-user@hadoop.apache.org
  Subject: Re: Unending Leader Elections in WAN deploy
 
  It should be in 479. Perhaps you have a stale version of the
  patch.
 
  -Flavio
 
  On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
 
  Flavio,
 
  I'm getting a compilation error for patch 491:
 
  compile-main:
[javac] Compiling 1 source file to
 
  /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
  src/p
  atched/branch-3.2/build/classes
[javac]
 
  /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
  src/p
 
  atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
  FastL
  eaderElection.java:601: cannot find symbol
[javac] symbol  : method getWeight(long)
[javac] location: interface
  org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
[javac]
  if(self.getQuorumVerifier().getWeight(n.sid) != 0)
[javac

RE: test failures in branch-3.2

2009-07-31 Thread Todd Greenwood
Patrick,
Thank you for the background (and I hope you and Mahadev recover
quickly).

On a plus note, I'm finding that this morning, @work rather than @home,
the tests continue to completion. However, there are other issues that
I'll bring up on the dev list, such as a requirement to have autoconf
installed, and problems in the create-cppunit-configure task that can't
exec libtoolize, fun stuff like tha.

I need to proceed with the manual patches to branch-3.2, as I am under
some time constraints to get our infrastructure deployed such that QA
can start playing with it. However, I'll switch to 3.2.1 as soon as I
can.

-Todd

 -Original Message-
 From: Patrick Hunt [mailto:ph...@apache.org]
 Sent: Friday, July 31, 2009 11:38 AM
 To: zookeeper-user@hadoop.apache.org; Todd Greenwood
 Subject: Re: test failures in branch-3.2
 
 Hi Todd,
 
 Sorry for the clutter/confusion. Usually things aren't this cumbersome
;-)
 
 In particular:
1 committer is on vacation
Mahadev's been out sick for multiple days
I'm sick but trying to hang in there, but def not 100%
 
 Hudson (CI) has been offline for effectively the past 3 weeks (that
 gates all our commits) and is just now back but flaky.
 
 3.2 had some bugs that we are trying to address, but the afore
mentioned
 issues are slowing us down. Otw we'd have all this straightened out by
 now 
 
 At this point you should move this discussion to the dev list - Apache
 doesn't really like us to discuss code changes/futures here (user
list).
 On that list you'll also see the plan for upcoming releases - I
mention
 all this because we are actively working toward 3.2.1 which will
include
 the JIRAs slated for that release (I'm sure you've seen).
 
 If you can wait a bit you might be able to avoid some pain by using
the
 upcoming 3.2.1 release. Once the patches land into that branch your
 issues will be resolved w/o you needing to manually apply patches,
etc...
 
 
 I did look at the files you attached - it looks fine so I'm not sure
the
 issue. The form of this test makes it harder - we are verifying that
the
 log contains sufficient information when a particular error occurs. We
 fiddle with log4j in order to do this, which means that the log you
are
 including doesn't specify the problem.
 
 Try instrumenting this test with a try/catch around the content of the
 test method (all the code in the failing method inside a big try/catch
 is what I mean). Then print the error to std out as part of the catch.
 That should shed some light. If you could debug it a bit that would
help
 - because we aren't seeing this in our environment.
 
 Again, sort of a moot point if you can wait a week or so...
 
 Regards,
 
 Patrick
 
 Todd Greenwood wrote:
  Inline.
 
  -Original Message-
  From: Patrick Hunt [mailto:ph...@apache.org]
  Sent: Thursday, July 30, 2009 10:57 PM
  To: zookeeper-user@hadoop.apache.org
  Subject: Re: test failures in branch-3.2
 
  Todd Greenwood wrote:
  Starting w/ branch-3.2 (no changes) I applied patches in this
order:
 
  1. Apply ZOOKEEPER-479.patch. Builds, but HierarchicalQuorumTest
  fails.
  2. Apply ZOOKEEPER-481.patch. Fails to build, b/c of missing file
-
  PortAssignment.java.
 
  PortAssignment.java was added by Patrick as part of
  ZOOKEEPER-473.patch,
  which is a pretty hefty patch ( 2k lines) and touches a large
  number of
  files.
  Hrm, those patches were probably created against the trunk. We'll
have
  to have separate patches for trunk and 3.2 branch on 481.
 
  If you could update the jira with this detail (481 needs two
patches,
  one for each branch) that would be great!
 
 
  Done.
 
  3. Apply ZOOKEEPER-473.patch. Builds, but QuorumPeerMainTest fails
  (jvm
  crashes).
  473 is special (unique) in the sense that it changes log4j while
the
  the vm is running. In general though it's a pretty boring test and
  shouldn't be failing.
 
  Are you sure you have the right patch file? there are 2 patch files
on
  the JIRA for 473, make sure that you have the one from 7/16, NOT
the
  one
  from 7/15. Check that the patch file, the correct one should NOT
  contain
  changes to build.xml or conf/log4j* files. If this still happens
send
  me
  your build.xml, conf/log4j* and QuroumPeerMainTest.java files in
email
  for review. I'll take a look.
 
 
 
  I've annotated the files w/ their date while downloading:
  112700 2009-07-31 11:02 ZOOKEEPER-473-7-15.patch
  110607 2009-07-31 11:01 ZOOKEEPER-473-7-16.patch
 
  It appears I applied the 7-16 patch, as that is the matching file
size
  of the patch file I applied.
 
  If there are to be multiple patch files for multiple branches (3.2,
  trunk, etc.) would it make sense to lable the patch files
accordingly?
 
  Requested files in attached tar.
 
  -Todd
 
  Patrick
 
 
  [junit] Running
  org.apache.zookeeper.server.quorum.QuorumPeerMainTest
  [junit] Running
  org.apache.zookeeper.server.quorum.QuorumPeerMainTest
  [junit] Tests run: 1, Failures: 0, Errors: 1, Time

Unending Leader Elections in WAN deploy

2009-07-31 Thread Todd Greenwood
This repro's in both branch-3.2, and branch-3.2+patches(473, 479, 481). 

Basically, it seems like the nodes are electing pd4-zook02 to be the
leader. However, pd4-zook02 seems to realize it's not supposed to be and
then disconnects everyone. Then they re-elect it again, and it loops
over and over.

-
Server config
-

server.1=dc1-zook01.dc01.revsci.net:2888:3888
server.2=dc1-zook02.dc01.revsci.net:2888:3888
server.3=dc1-zook03.dc01.revsci.net:2888:3888
server.4=dc1-zook04.dc01.revsci.net:2888:3888
server.5=dc1-zook05.dc01.revsci.net:2888:3888
server.6=pd1-zook01.pd01.revsci.net:2888:3888
server.7=pd1-zook02.pd01.revsci.net:2888:3888
server.8=pd4-zook01.iad1.audsci.net:2888:3888
server.9=pd4-zook02.iad1.audsci.net:2888:3888

group.1:1:2:3:4:5   
weight.1=1
weight.2=1
weight.3=1
weight.4=1
weight.5=1

group.2:6:7:8:9
weight.6=0
weight.7=0
weight.8=0
weight.9=0

Note that we have 2 groups, composed of machines in 3 different
locations (dc1, pd1, and pd4). The idea is that only machines in dc1
have voting rights, and the ability to become a leader. The machines in
the pods all have a weight of zero, and are not expected to become
leaders, or to vote on transactions.

Let me know what I can do to help resolve this issue.

-Todd


RE: Unending Leader Elections in WAN deploy

2009-07-31 Thread Todd Greenwood
Ok, I'll apply that patch and report back.
-Todd

 -Original Message-
 From: Flavio Junqueira [mailto:f...@yahoo-inc.com]
 Sent: Friday, July 31, 2009 7:18 PM
 To: zookeeper-user@hadoop.apache.org
 Subject: Re: Unending Leader Elections in WAN deploy
 
 You're missing 491 from your set of patches.
 
 -Flavio
 
 On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
 
  This repro's in both branch-3.2, and branch-3.2+patches(473, 479,
  481).
 
  Basically, it seems like the nodes are electing pd4-zook02 to be the
  leader. However, pd4-zook02 seems to realize it's not supposed to be
  and
  then disconnects everyone. Then they re-elect it again, and it loops
  over and over.
 
  -
  Server config
  -
 
  server.1=dc1-zook01.dc01.revsci.net:2888:3888
  server.2=dc1-zook02.dc01.revsci.net:2888:3888
  server.3=dc1-zook03.dc01.revsci.net:2888:3888
  server.4=dc1-zook04.dc01.revsci.net:2888:3888
  server.5=dc1-zook05.dc01.revsci.net:2888:3888
  server.6=pd1-zook01.pd01.revsci.net:2888:3888
  server.7=pd1-zook02.pd01.revsci.net:2888:3888
  server.8=pd4-zook01.iad1.audsci.net:2888:3888
  server.9=pd4-zook02.iad1.audsci.net:2888:3888
 
  group.1:1:2:3:4:5
  weight.1=1
  weight.2=1
  weight.3=1
  weight.4=1
  weight.5=1
 
  group.2:6:7:8:9
  weight.6=0
  weight.7=0
  weight.8=0
  weight.9=0
 
  Note that we have 2 groups, composed of machines in 3 different
  locations (dc1, pd1, and pd4). The idea is that only machines in dc1
  have voting rights, and the ability to become a leader. The machines
  in
  the pods all have a weight of zero, and are not expected to become
  leaders, or to vote on transactions.
 
  Let me know what I can do to help resolve this issue.
 
  -Todd



RE: Unending Leader Elections in WAN deploy

2009-07-31 Thread Todd Greenwood
Thanks. You were right, I had a stale version of 479. Compilation
succeeds and all tests pass on branch-3.2 with the latest patches 473,
479, 481, and 491.

-Todd
 
 -Original Message-
 From: Flavio Junqueira [mailto:f...@yahoo-inc.com]
 Sent: Friday, July 31, 2009 7:48 PM
 To: zookeeper-user@hadoop.apache.org
 Subject: Re: Unending Leader Elections in WAN deploy
 
 It should be in 479. Perhaps you have a stale version of the patch.
 
 -Flavio
 
 On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
 
  Flavio,
 
  I'm getting a compilation error for patch 491:
 
  compile-main:
 [javac] Compiling 1 source file to
  /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
  src/p
  atched/branch-3.2/build/classes
 [javac]
  /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
  src/p
  atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
  FastL
  eaderElection.java:601: cannot find symbol
 [javac] symbol  : method getWeight(long)
 [javac] location: interface
  org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
 [javac]
  if(self.getQuorumVerifier().getWeight(n.sid) != 0)
 [javac]^
 [javac] 1 error
 
  I see a reference to getWeight in both FastLeaderElection.java in
  patch
  491:
 
  patches/ZOOKEEPER-491.patch:+
  if(self.getQuorumVerifier().getWeight(n.sid) != 0)
  src/java/main/org/apache/zookeeper/server/quorum/
  FastLeaderElection.java
  :
  if(self.getQuorumVerifier().getWeight(n.sid) !=
  0)
 
  However, I don't see a reference to this method in patches 473, 479,
  or
  481. I also don't see a reference to this method in the trunk...
 
  -Todd
 
  -Original Message-
  From: Todd Greenwood [mailto:to...@audiencescience.com]
  Sent: Friday, July 31, 2009 7:30 PM
  To: zookeeper-user@hadoop.apache.org
  Subject: RE: Unending Leader Elections in WAN deploy
 
  Ok, I'll apply that patch and report back.
  -Todd
 
  -Original Message-
  From: Flavio Junqueira [mailto:f...@yahoo-inc.com]
  Sent: Friday, July 31, 2009 7:18 PM
  To: zookeeper-user@hadoop.apache.org
  Subject: Re: Unending Leader Elections in WAN deploy
 
  You're missing 491 from your set of patches.
 
  -Flavio
 
  On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
 
  This repro's in both branch-3.2, and branch-3.2+patches(473, 479,
  481).
 
  Basically, it seems like the nodes are electing pd4-zook02 to be
  the
  leader. However, pd4-zook02 seems to realize it's not supposed to
  be
  and
  then disconnects everyone. Then they re-elect it again, and it
  loops
  over and over.
 
  -
  Server config
  -
 
  server.1=dc1-zook01.dc01.revsci.net:2888:3888
  server.2=dc1-zook02.dc01.revsci.net:2888:3888
  server.3=dc1-zook03.dc01.revsci.net:2888:3888
  server.4=dc1-zook04.dc01.revsci.net:2888:3888
  server.5=dc1-zook05.dc01.revsci.net:2888:3888
  server.6=pd1-zook01.pd01.revsci.net:2888:3888
  server.7=pd1-zook02.pd01.revsci.net:2888:3888
  server.8=pd4-zook01.iad1.audsci.net:2888:3888
  server.9=pd4-zook02.iad1.audsci.net:2888:3888
 
  group.1:1:2:3:4:5
  weight.1=1
  weight.2=1
  weight.3=1
  weight.4=1
  weight.5=1
 
  group.2:6:7:8:9
  weight.6=0
  weight.7=0
  weight.8=0
  weight.9=0
 
  Note that we have 2 groups, composed of machines in 3 different
  locations (dc1, pd1, and pd4). The idea is that only machines in
  dc1
  have voting rights, and the ability to become a leader. The
  machines
  in
  the pods all have a weight of zero, and are not expected to
become
  leaders, or to vote on transactions.
 
  Let me know what I can do to help resolve this issue.
 
  -Todd
 



RE: Zookeeper WAN Configuration

2009-07-30 Thread Todd Greenwood
Patrick - Thank you, I'll proceed accordingly. -Todd

-Original Message-
From: Patrick Hunt [mailto:ph...@apache.org] 
Sent: Wednesday, July 29, 2009 10:30 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Zookeeper WAN Configuration

 [Todd] What is the recommended policy regarding patching zookeeper
 locally? As an external user, should I patch and compile in the trunk
or
 in the branch (branch-3.2)? 
 
 I've looked at :
 http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute
 http://wiki.apache.org/hadoop/HowToRelease
 
 And both of these seem well thought out but aimed at commiters
commiting
 to the trunk. 
 

In your context (want 3.2 features) you probably want to build based on 
the 3.2 tag, that way you are working off a known quantity. I'd suggest 
strongly that as part of your build you document the source base and 
which patches/changes you have applied. Having this information will be 
critical for you (or someone using your build) in case bugs have to be 
filed, or further changes/patches have to be applied, etc...

Patrick


RE: bad svn url : test-patch

2009-07-30 Thread Todd Greenwood
Thanks Mahadev.

-Original Message-
From: Mahadev Konar [mailto:maha...@yahoo-inc.com] 
Sent: Thursday, July 30, 2009 3:00 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: bad svn url : test-patch

Hi Todd,
  Yes this happens with the branch 3.2. The test-patch  link is broken
becasuse of the hadoop split. This file is used for hudson test
environment.
It isnt used anywhere else, so the svn co otherwise should be fine. We
should fix it anyways.

Thanks
mahadev


On 7/30/09 2:57 PM, Todd Greenwood to...@audiencescience.com wrote:

 FYI - looks like there is a bad url in svn...
 
 $ svn co
 http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
 branch-3.2
 
 ...
 Abranch-3.2/build.xml
 
 Fetching external item into 'branch-3.2/src/java/test/bin'
 svn: URL
 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
 doesn't exist
 
 This does not repro w/ 3.1:
 
 $ svn co
 http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.1
 branch-3.1
 
 -Todd
 



RE: test failures in branch-3.2

2009-07-30 Thread Todd Greenwood
No edits to conf/log4j.properties.

-Original Message-
From: Patrick Hunt [mailto:ph...@apache.org] 
Sent: Thursday, July 30, 2009 9:25 PM
To: Patrick Hunt
Cc: zookeeper-user@hadoop.apache.org
Subject: Re: test failures in branch-3.2

btw QuorumPeerMainTest uses the CONSOLE appender which is setup in 
conf/log4j.properties, now that I think of it perhaps not such a good 
idea :-)

If you edited cong/log4j.properties it may be causing the test to fail, 
did you do this? (if you run the test by itself using -Dtestcase does it

always fail?)

I've entered a jira to address this:
https://issues.apache.org/jira/browse/ZOOKEEPER-492

Patrick

Patrick Hunt wrote:
 Todd Greenwood wrote:
 The build succeeds, but not the all of the tests. In previous test
runs,
 I noticed an error in org.apache.zookeeper.test.FLETest. It was not
able
 to bind to a port or something. Now, after a machine reboot, I'm
getting
 different failures. 
 
 address in use? That's a problem in the test framework pre-3.3. In
3.3 
 (current svn trunk) I fixed it but it's not in 3.2.x. This is a
problem 
 with the test framework though and not a real problem, it shows up 
 occasionally (depends on timing).
 
 branch-3.2 $ ant test

 [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
 FAILED (crashed)
 [junit] Test org.apache.zookeeper.test.HierarchicalQuorumTest FAILED

 Test logs for these two tests attached.
 
 This is unusual though - looking at the log it seems that the JVM
itself 
 crashed for the QPMainTest! for HQT we are seeing:
 
 junit.framework.AssertionFailedError: Threads didn't join
 
 which Flavio mentioned to me once is possible to happen but not a real

 problem (he can elaborate).
 
 What version of java are you using? OS, other environment that might
be 
 interesting? (vm? etc...) You might try looking at the jvm crash dump 
 file (I think it's in /tmp)
 
 If you run each of these two tests individually do they run? example:
 ant -Dtestcase=FLENewEpochTest test-core-java
 
 My goal here is to get to a known state (all tests succeeding or have
 workarounds for the failures). Following that, I plan to apply the
 patches Flavio recommended for a WAN deploy (479 and 481). After I
 verify that the tests continue to run, I'll package this up and
deploy
 it to our WAN for testing. 
 
 Sounds like a good plan.
 
 So, are these known issues? Do the tests normally run en masse, or do
 some of the tests hold on to resources and prevent other tests from
 passing?
 
 Typically they do run to completion, but occasionally on my machine 
 (java 1.6, linux32bit, 1.6g single core cpu, 1gigmem) I'll get some 
 random failure due to address in use, or the same didn't join that
you 
 saw. Usually I see this if I'm multitasking (vs just letting the tests

 run w/o using the box). As I said this is addressed in 3.3 (address 
 reuse at the very least, and I haven't see the other issues).
 
 Patrick
 
 


RE: test failures in branch-3.2

2009-07-30 Thread Todd Greenwood
Patrick, inline.

-Original Message-
From: Patrick Hunt [mailto:ph...@apache.org] 
Sent: Thursday, July 30, 2009 9:13 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: test failures in branch-3.2

Todd Greenwood wrote:
 The build succeeds, but not the all of the tests. In previous test
runs,
 I noticed an error in org.apache.zookeeper.test.FLETest. It was not
able
 to bind to a port or something. Now, after a machine reboot, I'm
getting
 different failures. 

address in use? That's a problem in the test framework pre-3.3. In 3.3

(current svn trunk) I fixed it but it's not in 3.2.x. This is a problem 
with the test framework though and not a real problem, it shows up 
occasionally (depends on timing).

[Todd] Yes, I believe address in use was the problem w/ FLETest. I
assumed it was a timing issue w/ respect to test A not fully releasing
resources before test B started.

 branch-3.2 $ ant test
 
 [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
 FAILED (crashed)
 [junit] Test org.apache.zookeeper.test.HierarchicalQuorumTest FAILED
 
 Test logs for these two tests attached.

This is unusual though - looking at the log it seems that the JVM itself

crashed for the QPMainTest! for HQT we are seeing:

junit.framework.AssertionFailedError: Threads didn't join

which Flavio mentioned to me once is possible to happen but not a real 
problem (he can elaborate).

What version of java are you using? OS, other environment that might be 
interesting? (vm? etc...) You might try looking at the jvm crash dump 
file (I think it's in /tmp)

[Todd] ---
$ uname -a
Linux TODDG01LT 2.6.28-14-generic #47-Ubuntu SMP Sat Jul 25 01:19:55 UTC
2009 x86_64 GNU/Linux

$ which java
/home/toddg/bin/x64/java/jdk1.6.0_13/bin/java

$ java -version
java version 1.6.0_13
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)

Memory = 4GB
[Todd] ---

If you run each of these two tests individually do they run? example:
ant -Dtestcase=FLENewEpochTest test-core-java

[Todd] Will try this once my local build is working and report back.
I'll open a separate mail thread on applying patches.

 My goal here is to get to a known state (all tests succeeding or have
 workarounds for the failures). Following that, I plan to apply the
 patches Flavio recommended for a WAN deploy (479 and 481). After I
 verify that the tests continue to run, I'll package this up and deploy
 it to our WAN for testing. 

Sounds like a good plan.

 So, are these known issues? Do the tests normally run en masse, or do
 some of the tests hold on to resources and prevent other tests from
 passing?

Typically they do run to completion, but occasionally on my machine 
(java 1.6, linux32bit, 1.6g single core cpu, 1gigmem) I'll get some 
random failure due to address in use, or the same didn't join that you

saw. Usually I see this if I'm multitasking (vs just letting the tests 
run w/o using the box). As I said this is addressed in 3.3 (address 
reuse at the very least, and I haven't see the other issues).

Patrick




RE: test failures in branch-3.2

2009-07-30 Thread Todd Greenwood
Patrick/Flavio -

Starting w/ branch-3.2 (no changes) I applied patches in this order:

1. Apply ZOOKEEPER-479.patch. Builds, but HierarchicalQuorumTest fails.
2. Apply ZOOKEEPER-481.patch. Fails to build, b/c of missing file -
PortAssignment.java.

PortAssignment.java was added by Patrick as part of ZOOKEEPER-473.patch,
which is a pretty hefty patch ( 2k lines) and touches a large number of
files. 

3. Apply ZOOKEEPER-473.patch. Builds, but QuorumPeerMainTest fails (jvm
crashes).

[junit] Running org.apache.zookeeper.server.quorum.QuorumPeerMainTest
[junit] Running
org.apache.zookeeper.server.quorum.QuorumPeerMainTest
[junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
[junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
FAILED (crashed)


Test Log

Testsuite: org.apache.zookeeper.server.quorum.QuorumPeerMainTest
Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec 

Testcase: testBadPeerAddressInQuorum took 0.004 sec 
Caused an ERROR
Forked Java VM exited abnormally. Please note the time in the report
does not reflect the time until the VM exit.
junit.framework.AssertionFailedError: Forked Java VM exited abnormally.
Please note the time in the report does not reflect the time until the
VM exit.

-Todd

-Original Message-
From: Patrick Hunt [mailto:ph...@apache.org] 
Sent: Thursday, July 30, 2009 10:13 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: test failures in branch-3.2

Todd Greenwood wrote:
 
 [Todd] Yes, I believe address in use was the problem w/ FLETest. I
 assumed it was a timing issue w/ respect to test A not fully releasing
 resources before test B started.

Might be, but actually I think it's related to this:
http://hea-www.harvard.edu/~fine/Tech/addrinuse.html

Patrick


Zookeeper WAN Configuration

2009-07-24 Thread Todd Greenwood
Like most folks, our WAN is composed of various zones, some central
processing, some edge, some corp, and some in between (DMZs). In this
model, a given Zookeeper server will not have direct connectivity to all
of it's peers in the ensemble due to various security constraints. Is
this a problem? Are there special configurations for this model?

Given 3 Zones
-

A -- B
 B -- C

A cannot see C, and vice versa.
B can see A and C.

1. Will zookeeper servers function properly even if a given set of
servers can only see some of the servers in the ensemble? For example,
the shared config lists all zk servers in A, B, and C, but A can only
see B, C can only see B, and B can see both A and C.

2. Will zookeeper servers flood the log with error messages if only a
subset of the ensemble members are visible?

3. Will the zk ensemble function properly if the config used by each
server only lists the servers in the ensemble that are visible? Suppose
that A has a config that only list servers in A and B, C a config for C
and B, and B has a config that lists servers in A, B, and C. Is this the
recommended approach?

http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperAdmin.html


RE: Zookeeper WAN Configuration

2009-07-24 Thread Todd Greenwood
Flavio  Ted, thank you for your comments.

So it sounds like the only way to currently deploy to the WAN is to
deploy ZK Servers to the central DC and open up client connections to
these ZK servers from the edge nodes. True?

In the future, once the Observers feature is implemented, then we should
be able to deploy zk servers to both the DC and to the pods...with all
the goodness that Flavio mentions below.

Flavio - do you have a doc that describes exactly what happens in the
transaction of a write operation? For instance, I'd like to know at
exactly what stage a write has been commited to the ensemble, and not
just the zk server the client is connected to. I figure it must be
something like:

clientA.write(path, value)
- serverA writes to memory
- serverA writes to transacted disk every n/seconds or m/bytes
- serverA sends write to Leader
- Leader stamps with transaction id
- Leader responds to ensemble with update + transaction id

-Todd

-Original Message-
From: Flavio Junqueira [mailto:f...@yahoo-inc.com] 
Sent: Friday, July 24, 2009 4:50 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Zookeeper WAN Configuration

Just a few quick observations:

On Jul 24, 2009, at 4:40 PM, Ted Dunning wrote:

 On Fri, Jul 24, 2009 at 4:23 PM, Todd Greenwood
 to...@audiencescience.comwrote:

 Could you explain the idea behind the Observers feature, what this
 concept is supposed to address, and how it applies to the WAN
 configuration problem in particular?


 Not really.  I am just echoing comments on observers from them that  
 know.


Without observers, increasing the number of servers in an ensemble  
enables higher read throughput, but causes write throughput to drop  
because the number of votes to order each write operation increases.  
Essentially, observers are zookeeper servers that don't vote when  
ordering updates to the zookeeper state. Adding observers enables  
higher read throughput affecting minimally write throughput (leader  
still has to send commits to everyone, at least in the version we have  
been working on).


 
 The ideas for federating ZK or allowing observers would likely do  
 what
 you
 want.  I can imagine that an observer would only care that it can see
 it's
 local peers and one of the observers would be elected to get updates
 (and
 thus would care about the central service).
 
 This certainly sounds like exactly what I want...Was this  
 introduced in
 3.2 in full, or only partially?


 I don't think it is even in trunk yet.  Look on Jira or at the  
 recent logs
 of this mailing list.

It is not on trunk yet.

-Flavio



RE: Leader Elections

2009-07-20 Thread Todd Greenwood
Flavio, Ted, Henry, Scott, this would perfectly well for my use case
provided:

SINGLE ENSEMBLE:
GROUP A : ZK Servers w/ read/write AND Leader Elections
GROUP B : ZK Servers w/ read/write W/O Leader Elections

So, we can craft this via Observers and Hiererarchial Quorum groups?
Great. Problem solved.

When will this be production ready? :o)



Scott brought up a multi-feature that is very interesting for me.
Namely:

1. Offline ZK servers that sync  merge on reconnect

The offline servers seems conceptually simple, it's kind of like a
messaging system. However, the merge and resolve step when two servers
reconnect might be challenging. Cool idea though.

2. Partial memory graph subscriptions

The second idea is partial memory graph subscriptions. This would enable
virtual ensembles to interract on the same physical ensemble. For my use
case, this would prevent unnecessary cross talk between nodes on a WAN,
allowing me to define the subsets of the memory graph that need to be
replicated, and to whom. This would be a huge scalability win for WAN
use cases.

-Todd

-Original Message-
From: Scott Carey [mailto:sc...@richrelevance.com] 
Sent: Monday, July 20, 2009 11:00 AM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Leader Elections

Observers would be awesome especially with a couple enhancements /
extensions:

An option for the observers to enter a special state if the WAN link
goes down to the master cluster.  A read-only option would be great.
However, allowing certain types of writes to continue on a limited basis
would be highly valuable as well.  An observer could own a special
node and its subnodes.  Only these subnodes would be writable by the
observer when there was a session break to the master cluster, and the
master cluster would take all the changes when the link is
reestablished.  Essentially, it is a portion of the hierarchy that is
writable only by a specitfic observer, and read-only for others.
The purpose of this would be for when the WAN link goes down to the
master ZKs for certain types of use cases - status updates or other
changes local to the observer that are strictly read-only outside the
Observer's 'realm'.


On 7/19/09 12:16 PM, Henry Robinson he...@cloudera.com wrote:

You can. See ZOOKEEPER-368 - at first glance it sounds like observers
will
be a good fit for your requirements.

Do bear in mind that the patch on the jira is only for discussion
purposes;
I would not consider it currently fit for production use. I hope to put
up a
much better patch this week.

Henry

On Sat, Jul 18, 2009 at 7:38 PM, Ted Dunning ted.dunn...@gmail.com
wrote:

 Can you submit updates via an observer?

 On Sat, Jul 18, 2009 at 6:38 AM, Flavio Junqueira f...@yahoo-inc.com
 wrote:

  2- Observers: you could have one computing center containing an
ensemble
  and observers around the edge just learning committed values.




 --
 Ted Dunning, CTO
 DeepDyve




RE: Leader Elections

2009-07-20 Thread Todd Greenwood
Henry, cool. When youre patch is ready for testing, I'll devote some
time to take a test pass on it.

-Original Message-
From: Henry Robinson [mailto:he...@cloudera.com] 
Sent: Monday, July 20, 2009 2:54 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Leader Elections

On Mon, Jul 20, 2009 at 7:50 PM, Todd Greenwood
to...@audiencescience.comwrote:

 Flavio, Ted, Henry, Scott, this would perfectly well for my use case
 provided:

 SINGLE ENSEMBLE:
GROUP A : ZK Servers w/ read/write AND Leader Elections
GROUP B : ZK Servers w/ read/write W/O Leader Elections

 So, we can craft this via Observers and Hiererarchial Quorum groups?
 Great. Problem solved.

 When will this be production ready? :o)


Looks to me like you don't even need hierarchical quorums for this -
make
everyone in group B an Observer and you're done.

I've been working on this feature. Recently we've been discussing a
proof-of-concept patch on the JIRA. I have nearly finished a less rough
patch which I will submit for discussion and potentially commit this
week.
At that point it would be extremely helpful if you could help test the
patch, and you can start considering it for production. To get into
trunk I
will have to write a comprehensive test suite and update the
documentation,
and then making sure all the boxes are ticked and no regressions are
thrown
up can take a little while.

Henry





 

 Scott brought up a multi-feature that is very interesting for me.
 Namely:

 1. Offline ZK servers that sync  merge on reconnect

 The offline servers seems conceptually simple, it's kind of like a
 messaging system. However, the merge and resolve step when two servers
 reconnect might be challenging. Cool idea though.

 2. Partial memory graph subscriptions

 The second idea is partial memory graph subscriptions. This would
enable
 virtual ensembles to interract on the same physical ensemble. For my
use
 case, this would prevent unnecessary cross talk between nodes on a
WAN,
 allowing me to define the subsets of the memory graph that need to be
 replicated, and to whom. This would be a huge scalability win for WAN
 use cases.

 -Todd

 -Original Message-
 From: Scott Carey [mailto:sc...@richrelevance.com]
 Sent: Monday, July 20, 2009 11:00 AM
 To: zookeeper-user@hadoop.apache.org
 Subject: Re: Leader Elections

 Observers would be awesome especially with a couple enhancements /
 extensions:

 An option for the observers to enter a special state if the WAN link
 goes down to the master cluster.  A read-only option would be great.
 However, allowing certain types of writes to continue on a limited
basis
 would be highly valuable as well.  An observer could own a special
 node and its subnodes.  Only these subnodes would be writable by the
 observer when there was a session break to the master cluster, and the
 master cluster would take all the changes when the link is
 reestablished.  Essentially, it is a portion of the hierarchy that is
 writable only by a specitfic observer, and read-only for others.
 The purpose of this would be for when the WAN link goes down to the
 master ZKs for certain types of use cases - status updates or other
 changes local to the observer that are strictly read-only outside the
 Observer's 'realm'.


 On 7/19/09 12:16 PM, Henry Robinson he...@cloudera.com wrote:

 You can. See ZOOKEEPER-368 - at first glance it sounds like observers
 will
 be a good fit for your requirements.

 Do bear in mind that the patch on the jira is only for discussion
 purposes;
 I would not consider it currently fit for production use. I hope to
put
 up a
 much better patch this week.

 Henry

 On Sat, Jul 18, 2009 at 7:38 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  Can you submit updates via an observer?
 
  On Sat, Jul 18, 2009 at 6:38 AM, Flavio Junqueira
f...@yahoo-inc.com
  wrote:
 
   2- Observers: you could have one computing center containing an
 ensemble
   and observers around the edge just learning committed values.
 
 
 
 
  --
  Ted Dunning, CTO
  DeepDyve